Optimus: A Robust Defense Framework for Mitigating Toxicity while Fine-Tuning Conversational AI
Pith reviewed 2026-05-22 12:13 UTC · model grok-4.3
The pith
Optimus mitigates toxicity during fine-tuning of conversational AI even when using highly biased toxicity classifiers.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Optimus combines a training-free toxicity classification scheme that repurposes the safety alignment of commodity LLMs with a dual-strategy alignment process using synthetic healing data and Direct Preference Optimization to steer models toward safety, thereby mitigating toxicity even when toxicity classifiers suffer up to 85% degradation in recall, outperforming StarDSS and showing resilience to adaptive adversarial and jailbreak attacks.
What carries the argument
The training-free toxicity classification scheme repurposing safety alignment of LLMs, paired with synthetic healing data and DPO for dual-strategy alignment.
If this is right
- Customizing LLMs on untrusted data can be done with lower risk of injecting toxic behaviors.
- Defense performance does not depend on having highly accurate toxicity classifiers.
- Models maintain their conversational utility after the safety alignment process.
- Protection extends to resisting post-fine-tuning adversarial and jailbreak attempts.
Where Pith is reading between the lines
- This suggests that built-in safety mechanisms in LLMs may contain extractable signals useful for other safety tasks without retraining.
- Similar repurposing techniques could be tested for mitigating other unwanted behaviors like bias or hallucinations in fine-tuned models.
- Deployments of fine-tuned conversational AI in sensitive areas might become more feasible with such lightweight defenses.
Load-bearing premise
The safety alignment already present in commodity LLMs can be directly repurposed as a reliable, training-free toxicity classifier that remains effective across different base models and datasets.
What would settle it
Testing Optimus on a new base model and dataset where the repurposed safety alignment fails to identify most toxic samples in the fine-tuning data, then checking if toxicity in model outputs remains low after the alignment step.
Figures
read the original abstract
Customizing Large Language Models (LLMs) on untrusted datasets poses severe risks of injecting toxic behaviors. In this work, we introduce Optimus, a novel defense framework designed to mitigate fine-tuning harms while preserving conversational utility. Unlike existing defenses that rely heavily on precise toxicity detection or restrictive filtering, Optimus addresses the critical challenge of ensuring robust mitigation even when toxicity classifiers are imperfect or biased. Optimus integrates a training-free toxicity classification scheme that repurposes the safety alignment of commodity LLMs, and employs a dual-strategy alignment process combining synthetic "healing data" with Direct Preference Optimization (DPO) to efficiently steer models toward safety. Extensive evaluations demonstrate that Optimus mitigates toxicity even when relying on extremely biased classifiers (with up to 85% degradation in Recall). Optimus outperforms the state-of-the-art defense StarDSS and exhibits strong resilience against adaptive adversarial and jailbreak attacks. Our source code and datasets are available at https://github.com/secml-lab-vt/Optimus
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Optimus, a defense framework to mitigate toxicity risks when fine-tuning conversational LLMs on untrusted data. It proposes a training-free toxicity classifier that repurposes the safety alignment already present in commodity LLMs, generates synthetic healing data from this signal, and applies Direct Preference Optimization (DPO) to steer the model toward safer behavior while preserving utility. Key empirical claims include effective toxicity mitigation even under classifiers with up to 85% recall degradation, outperformance versus the prior StarDSS baseline, and resilience to adaptive adversarial and jailbreak attacks. The work provides open-source code and datasets.
Significance. If the central robustness claims hold under the reported conditions, the work is significant for practical LLM fine-tuning pipelines, where high-quality toxicity detectors are often unavailable or biased. The training-free reuse of existing safety alignments is an efficient approach that avoids additional training overhead for detection. Explicit open-sourcing of code and datasets is a strength that supports reproducibility and follow-on work in the field.
major comments (1)
- [§4] §4 (Experimental Evaluation), results on recall-degraded classifiers: the headline claim that mitigation remains effective at up to 85% recall degradation is load-bearing for the paper's contribution. The manuscript must quantify the actual number of toxic instances detected and included in the healing data at each degradation level, and must include an explicit DPO-only baseline (i.e., preference optimization without any classifier-generated healing data) to demonstrate that the observed gains are attributable to the training-free scheme rather than generic alignment.
minor comments (2)
- [§3.2] The abstract and §3.2 use the term 'healing data' without a concise formal definition or pseudocode for its generation process; a short algorithm box would improve clarity.
- [Table 3] Table 3 (or equivalent results table) reports performance numbers but does not include standard error or statistical significance tests across the multiple runs; adding these would strengthen the comparison to StarDSS.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed review of our manuscript. We have carefully considered the major comment on the experimental evaluation and provide a point-by-point response below. We agree that the requested additions will improve clarity and will incorporate them in the revised version.
read point-by-point responses
-
Referee: [§4] §4 (Experimental Evaluation), results on recall-degraded classifiers: the headline claim that mitigation remains effective at up to 85% recall degradation is load-bearing for the paper's contribution. The manuscript must quantify the actual number of toxic instances detected and included in the healing data at each degradation level, and must include an explicit DPO-only baseline (i.e., preference optimization without any classifier-generated healing data) to demonstrate that the observed gains are attributable to the training-free scheme rather than generic alignment.
Authors: We agree that quantifying the number of toxic instances detected at each recall degradation level will provide valuable transparency into the strength of the training-free signal under bias. In the revised manuscript, we will add a table in §4 reporting, for each degradation level (including up to 85%), the exact count of toxic examples identified by the repurposed LLM classifier and the number subsequently used to synthesize the healing data. This will allow readers to directly assess how the volume of the safety signal changes with classifier quality. Regarding the DPO-only baseline, we acknowledge that an explicit comparison isolating the contribution of the classifier-generated healing data is useful to rule out generic alignment effects. We will add this baseline by running DPO on the fine-tuning dataset without any classifier-derived preferences or healing data, and include the results alongside the full Optimus pipeline in the updated experiments. These revisions will directly address the load-bearing nature of the robustness claim. revision: yes
Circularity Check
No circularity: empirical evaluation of training-free classifier and DPO alignment on external benchmarks
full rationale
The paper presents an empirical defense framework relying on repurposing LLM safety alignment as a toxicity signal, generating healing data, and applying DPO. All central claims are supported by experimental results comparing against baselines like StarDSS, testing under controlled recall degradation, and evaluating resilience to attacks. No equations, derivations, or self-referential fitting are described; the method is not shown to reduce to its inputs by construction. The approach is self-contained against external datasets and benchmarks, with no load-bearing self-citation chains or ansatz smuggling identified in the provided text.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Safety alignments in commodity LLMs can be repurposed for accurate toxicity classification without additional training.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Optimus integrates a training-free toxicity classification scheme that repurposes the safety alignment of commodity LLMs, and employs a dual-strategy alignment process combining synthetic 'healing data' with Direct Preference Optimization (DPO)
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We propose a zero-shot prompting approach that leverages the instruction following and safety alignment abilities of LLMs. We call this Refusal approach.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 1 Pith paper
-
Reasoning Structure Matters for Safety Alignment of Reasoning Models
Changing the internal reasoning structure of large reasoning models through simple supervised fine-tuning on 1K examples produces strong safety alignment that generalizes across tasks and languages.
Reference graph
Works this paper leans on
-
[1]
https://platform.openai.com/docs/guides/mo deration
OpenAI moderation API. https://platform.openai.com/docs/guides/mo deration
-
[2]
https://platform.openai.com/docs/a pi-reference/moderations/object
OpenAI moderation API categories. https://platform.openai.com/docs/a pi-reference/moderations/object
- [3]
-
[4]
The CRINGE Loss: Learning what language not to model
Leonard Adolphs, Tianyu Gao, Jing Xu, Kurt Shuster, Sainbayar Sukhbaatar, and Jason Weston. The CRINGE Loss: Learning what language not to model. In Proc. of ACL , 2023
work page 2023
-
[5]
https://aws.amazon.com/bedrock/, 2024
work page 2024
-
[6]
Revisiting Contextual Toxicity Detection in Conversations
Atijit Anuchitanukul, Julia Ive, and Lucia Specia. Revisiting Contextual Toxicity Detection in Conversations. Proc. of CoRR abs/2111.12447 , 2021
-
[7]
https://azure.microsoft.com/en-us/products/ai-services/openai-service, 2024
work page 2024
-
[8]
Just Say No: Analyzing the Stance of Neural Dialogue Generation in Offensive Contexts
Ashutosh Baheti, Maarten Sap, Alan Ritter, and Mark Riedl. Just Say No: Analyzing the Stance of Neural Dialogue Generation in Offensive Contexts. In Proc. of EMNLP , 2021
work page 2021
-
[9]
Assessing political prudence of open-domain chatbots
Yejin Bang, Nayeon Lee, Etsuko Ishii, Andrea Madotto, and Pascale Fung. Assessing political prudence of open-domain chatbots. In Proc. of SIGDIAL, 2021
work page 2021
-
[11]
Language Models are Few-Shot Learners
Tom Brown et al. Language Models are Few-Shot Learners. In Proc. of NIPS, 2020
work page 2020
-
[12]
N. Carlini, M. Jagielski, C. Choquette-Choo, D. Paleka, W. Pearce, H. Anderson, A. Terzis, K. Thomas, and F. Tramèr. Poisoning Web- Scale Training Datasets is Practical. In Proc. of IEEE S&P , 2024
work page 2024
-
[13]
Are aligned neural networks adversarially aligned? In Proc
Nicholas Carlini, Milad Nasr, Christopher A Choquette-Choo, Matthew Jagielski, Irena Gao, Pang Wei Koh, Daphne Ippolito, Florian Tramèr, and Ludwig Schmidt. Are aligned neural networks adversarially aligned? In Proc. of NIPS , 2023
work page 2023
-
[14]
I Feel Offended, Don’t Be Abusive! Implicit/Explicit Messages in Offensive and Abusive Language
Tommaso Caselli, Valerio Basile, Jelena Mitrovi ´c, Inga Kartoziya, and Michael Granitzer. I Feel Offended, Don’t Be Abusive! Implicit/Explicit Messages in Offensive and Abusive Language. In Proc. of LREC, 2020
work page 2020
-
[15]
Introducing ChatGPT https://openai.com/blog/chatgpt/, 2022
work page 2022
-
[16]
Understanding Multi-Turn Toxic Behaviors in Open-Domain Chatbots
Bocheng Chen, Guangjing Wang, Hanqing Guo, Yuanda Wang, and Qiben Yan. Understanding Multi-Turn Toxic Behaviors in Open-Domain Chatbots. In Proc. of RAID , 2023
work page 2023
-
[17]
Wei-Lin Chiang et al. Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90%* ChatGPT Quality https://lmsys.org/blog/2023-03-3 0-vicuna/, 2023
work page 2023
-
[18]
Scaling Instruction-Finetuned Language Models
Hyung Won Chung et al. Scaling Instruction-Finetuned Language Models. Proc. of CoRR abs/2210.11416 , 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[19]
QLoRA: Efficient Finetuning of Quantized LLMs
Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. QLoRA: Efficient Finetuning of Quantized LLMs. Proc. of CoRR abs/2305.14314, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[20]
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proc. of ACL , 2019
work page 2019
-
[21]
Spruit, Dirk Hovy, Y-Lan Boureau, and Verena Rieser
Emily Dinan, Gavin Abercrombie, Ari Bergman, Shannon L. Spruit, Dirk Hovy, Y-Lan Boureau, and Verena Rieser. SafetyKit: First Aid for Measuring Safety in Open-domain Conversational Systems. In Proc. of ACL, 2022
work page 2022
-
[22]
Build it Break it Fix it for Dialogue Safety: Robustness from Adversarial Human Attack
Emily Dinan, Samuel Humeau, Bharath Chintagunta, and Jason Weston. Build it Break it Fix it for Dialogue Safety: Robustness from Adversarial Human Attack. In Proc. of EMNLP , 2019
work page 2019
-
[23]
A Survey on Automatic Detection of Hate Speech in Text
Paula Fortuna and Sérgio Nunes. A Survey on Automatic Detection of Hate Speech in Text. ACM Computing Surveys (CSUR) , 2018
work page 2018
-
[24]
RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models
Samuel Gehman, S Gururangan, Maarten Sap, Yejin Choi, and Noah A Smith. RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models. In Proc. of EMNLP , 2020
work page 2020
-
[25]
Chatbot AI makes racist judgements on the basis of dialect
Elizabeth Gibney. Chatbot AI makes racist judgements on the basis of dialect. https://www.nature.com/articles/d41586-024-00779-1, 2024
work page 2024
-
[26]
Demystifying Prompts in Language Models via Perplexity Estimation
Hila Gonen, Srini Iyer, Terra Blevins, Noah A Smith, and Luke Zettlemoyer. Demystifying Prompts in Language Models via Perplexity Estimation. Proc. of CoRR abs/2212.04037 , 2022
-
[27]
Parameter-Efficient Fine-Tuning for Large Models: A Comprehensive Survey
Zeyu Han, Chao Gao, Jinyang Liu, Sai Qian Zhang, et al. Parameter- Efficient Fine-Tuning for Large Models: A Comprehensive Survey.Proc. of CoRR abs/2403.14608 , 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
- [28]
-
[29]
Xinlei He, Savvas Zannettou, Yun Shen, and Yang Zhang. You Only Prompt Once: On the Capabilities of Prompt Learning on Large Lan- guage Models to Tackle Toxic Content. Proc. of CoRR abs/2308.05596, 2023
-
[30]
https://huggingface.co/, 2024
work page 2024
-
[31]
LoRA: Low-Rank Adaptation of Large Language Models
Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-Rank Adaptation of Large Language Models. Proc. of CoRR abs/2106.09685, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[32]
Zhanhao Hu, Julien Piet, Geng Zhao, Jiantao Jiao, and David Wagner. Toxicity Detection for Free. Proc. of CoRR abs/2405.18822 , 2024
-
[33]
GRADE: Automatic Graph- Enhanced Coherence Metric for Evaluating Open-Domain Dialogue Systems
L Huang, Z Ye, J Qin, L Lin, and X Liang. GRADE: Automatic Graph- Enhanced Coherence Metric for Evaluating Open-Domain Dialogue Systems. In Proc. of EMNLP , 2020
work page 2020
-
[34]
OPT-IML: Scaling Language Model Instruction Meta Learning through the Lens of Generalization
Srinivas Iyer et al. OPT-IML: Scaling Language Model Instruction Meta Learning through the Lens of Generalization. Proc. of CoRR abs/2212.12017, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[35]
A systematic review of Hate Speech automatic detection using Natural Language Processing
Md Saroar Jahan and Mourad Oussalah. A systematic review of Hate Speech automatic detection using Natural Language Processing. Neurocomputing, 2023
work page 2023
-
[36]
Di Jin, Zhijing Jin, Joey Tianyi Zhou, and Peter Szolovits. Is BERT Really Robust? A Strong Baseline for Natural Language Attack on Text Classification and Entailment. In Proc. of AAAI , 34:8018–8025, 2020
work page 2020
-
[37]
Chris J Kennedy, Geoff Bacon, Alexander Sahn, and Claudia von Vacano. Constructing interval variables via faceted rasch measurement and multitask deep learning: a hate speech application. Proc. of CoRR abs/2009.10277, 2020
-
[38]
CTRL: A Conditional Transformer Language Model for Controllable Generation
Nitish Shirish Keskar, Bryan McCann, Lav R Varshney, Caiming Xiong, and Richard Socher. CTRL: A Conditional Transformer Language Model for Controllable Generation. In Proc. of CoRR abs/1909.05858 , 2019
work page internal anchor Pith review Pith/arXiv arXiv 1909
-
[39]
San Kim and Gary Geunbae Lee. Adversarial DPO: Harnessing Harmful Data for Reducing Toxicity with Minimal Impact on Coherence and Evasiveness in Dialogue Agents. In Proc. of CoRR abs/2405.12900 , 2024
-
[40]
Understand- ing the Effects of RLHF on LLM Generalisation and Diversity
Robert Kirk, Ishita Mediratta, Christoforos Nalmpantis, Jelena Luketina, Eric Hambro, Edward Grefenstette, and Roberta Raileanu. Understand- ing the Effects of RLHF on LLM Generalisation and Diversity. In Proc. of ICLR, 2024
work page 2024
-
[41]
GeDi: 14 Generative Discriminator Guided Sequence Generation
Ben Krause, Akhilesh Deepak Gotmare, Bryan McCann, Nitish Shirish Keskar, Shafiq Joty, Richard Socher, and Nazneen Fatema Rajani. GeDi: 14 Generative Discriminator Guided Sequence Generation. In Proc. of EMNLP Findings, 2021
work page 2021
-
[42]
Vishal Kumar, Zeyi Liao, Jaylen Jones, and Huan Sun. Amplegcg-plus: A strong generative model of adversarial suffixes to jailbreak llms with higher success rates in fewer attempts. CoRR abs/2410.22143, 2024
-
[43]
Nomisha Kurian. ‘No, Alexa, no!’: designing child-safe AI and pro- tecting children from the risks of the ‘empathy gap’in large language models. Taylor & Francis Learning, Media and Technology , 2024
work page 2024
-
[44]
Mike Lewis et al. BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. In Proc. of ACL , 2020
work page 2020
-
[45]
BERT-ATTACK: Adversarial Attack Against BERT Using BERT
Linyang Li, Ruotian Ma, Qipeng Guo, Xiangyang Xue, and Xipeng Qiu. BERT-ATTACK: Adversarial Attack Against BERT Using BERT. In Proc. of EMNLP , 2020
work page 2020
-
[46]
DailyDialog: A Manually Labelled Multi-turn Dialogue Dataset
Yanran Li, Hui Su, Xiaoyu Shen, Wenjie Li, Ziqiang Cao, and Shuzi Niu. DailyDialog: A Manually Labelled Multi-turn Dialogue Dataset. In Proc. of IJCNLP , 2017
work page 2017
-
[47]
Focal Loss for Dense Object Detection
Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. Focal Loss for Dense Object Detection. In Proc. of CoRR abs/1708.02002, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[48]
DEXPERTS: Decoding- Time Controlled Text Generation with Experts and Anti-Experts
Alisa Liu, Maarten Sap, Ximing Lu, Swabha Swayamdipta, Chandra Bhagavatula, Noah A Smith, and Yejin Choi. DEXPERTS: Decoding- Time Controlled Text Generation with Experts and Anti-Experts. In Proc. of ACL , 2021
work page 2021
-
[49]
Automatic and universal prompt injection attacks against large language models,
Xiaogeng Liu, Zhiyuan Yu, Yizhe Zhang, Ning Zhang, and Chaowei Xiao. Automatic and universal prompt injection attacks against large language models. CoRR abs/2403.04957, 2024
-
[50]
Formalizing and Benchmarking Prompt Injection Attacks and Defenses
Yupei Liu, Yuqi Jia, Runpeng Geng, Jinyuan Jia, and Neil Zhenqiang Gong. Formalizing and Benchmarking Prompt Injection Attacks and Defenses. In Proc. of USENIX Security , 2024
work page 2024
-
[51]
Nicholas Meade, Spandana Gella, Devamanyu Hazarika, Prakhar Gupta, Di Jin, Siva Reddy, Yang Liu, and Dilek Z. Hakkani-Tür. Using In- Context Learning to Improve Dialogue Safety. In Proc. of EMNLP Findings, 2023
work page 2023
-
[52]
Prompt shields in azure ai content safety
Microsoft. Prompt shields in azure ai content safety. https://learn.micr osoft.com/en-us/azure/ai-services/content-safety/concepts/jailbreak-det ection, October 2024. Accessed: 2025-04-23
work page 2024
-
[53]
https://platform.openai.com/docs/guides/fine-tuning, 2024
work page 2024
-
[54]
Training language models to follow instructions with human feedback
Long Ouyang et al. Training language models to follow instructions with human feedback. In Proc. of NIPS , 2022
work page 2022
-
[55]
Toxicity Detection: Does Context Really Matter? In Proc
John Pavlopoulos, Jeffrey Scott Sorensen, Lucas Dixon, Nithum Thain, and Ion Androutsopoulos. Toxicity Detection: Does Context Really Matter? In Proc. of ACL, 2020
work page 2020
-
[56]
Respectful or Toxic? Using Zero-Shot Learning with Language Models to Detect Hate Speech
Flor Miriam Plaza-Del-Arco, Debora Nozza, and Dirk Hovy. Respectful or Toxic? Using Zero-Shot Learning with Language Models to Detect Hate Speech. In Proc. of WOAH, 2023
work page 2023
-
[57]
Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!
Xiangyu Qi, Yi Zeng, Tinghao Xie, Pin-Yu Chen, Ruoxi Jia, Prateek Mittal, and Peter Henderson. Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To! Proc. of CoRR abs/2310.03693, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[58]
Language Model Detoxification in Dialogue with Contextualized Stance Control
Jingu Qian and Xifeng Yan. Language Model Detoxification in Dialogue with Contextualized Stance Control. In Proc. of EMNLP Findings, 2023
work page 2023
-
[59]
Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christo- pher D. Manning, and Chelsea Finn. Direct Preference Optimization: Your Language Model is Secretly a Reward Model. In Proc. of NeurIPS, 2023
work page 2023
-
[60]
Ex- ploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Ex- ploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. The Journal of Machine Learning Research , 2020
work page 2020
-
[61]
Jeff Rasley, Samyam Rajbhandari, Olatunji Ruwase, and Yuxiong He. DeepSpeed: System Optimizations Enable Training Deep Learning Mod- els with Over 100 Billion Parameters. In Proc. of SIGKDD , 2020
work page 2020
-
[62]
Recipes for Building an Open-Domain Chatbot
Stephen Roller, Emily Dinan, Naman Goyal, Da Ju, Mary Williamson, Yinhan Liu, Jing Xu, Myle Ott, Eric Michael Smith, Y-Lan Boureau, and Jason Weston. Recipes for Building an Open-Domain Chatbot. In Proc. of ACL , 2021
work page 2021
-
[63]
Self-Diagnosis and Self-Debiasing: A Proposal for Reducing Corpus-Based Bias in NLP
Timo Schick, Sahana Udupa, and Hinrich Schütze. Self-Diagnosis and Self-Debiasing: A Proposal for Reducing Corpus-Based Bias in NLP. In Proc. of ACL, 2021
work page 2021
-
[64]
A Survey on Hate Speech Detec- tion using Natural Language Processing
Anna Schmidt and Michael Wiegand. A Survey on Hate Speech Detec- tion using Natural Language Processing. In Proc. of ACL SocialNLP , 2017
work page 2017
-
[65]
AI chatbot blamed for Belgian man’s suicide
Mark Sellman. AI chatbot blamed for Belgian man’s suicide. https: //www.thetimes.com/business-money/technology/article/ai-chatbot-bla med-for-belgian-mans-suicide-zcjzlztcc, 2023
work page 2023
-
[66]
BlenderBot 3: a deployed conversational agent that continually learns to responsibly engage
Kurt Shuster et al. BlenderBot 3: a deployed conversational agent that continually learns to responsibly engage. Proc. of CoRR abs/2208.03188, 2022
-
[67]
Why So Toxic? Measuring and Triggering Toxic Behavior in Open-Domain Chatbots
Wai Man Si, M Backes, J Blackburn, E De Cristofaro, G Stringhini, S Zannettou, and Y Zhang. Why So Toxic? Measuring and Triggering Toxic Behavior in Open-Domain Chatbots. In Proc. of the ACM CCS , 2022
work page 2022
-
[68]
A Large- scale Comprehensive Abusiveness Detection Dataset with Multifaceted Labels from Reddit
Hoyun Song, Soo Hyun Ryu, Huije Lee, and Jong C Park. A Large- scale Comprehensive Abusiveness Detection Dataset with Multifaceted Labels from Reddit. In Proc. of CoNLL , 2021
work page 2021
-
[69]
Conceptnet 5.5: An open multilingual graph of general knowledge
R Speer et al. Conceptnet 5.5: An open multilingual graph of general knowledge. In Proc of AAAI , 2017
work page 2017
-
[70]
On the Safety of Conversational Models: Taxonomy, Dataset, and Benchmark
Hao Sun, Guangxuan Xu, Deng Jiawen, Jiale Cheng, Chujie Zheng, Hao Zhou, Nanyun Peng, Xiaoyan Zhu, and Minlie Huang. On the Safety of Conversational Models: Taxonomy, Dataset, and Benchmark. In Proc. of ACL Findings , 2021
work page 2021
-
[71]
Llama 2: Open Foundation and Fine-Tuned Chat Models
Hugo Touvron et al. Llama 2: Open Foundation and Fine-Tuned Chat Models. Proc. of CoRR abs/2307.09288 , 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[72]
LLaMA: Open and Efficient Foundation Language Models
Hugo Touvron et al. LLaMA: Open and Efficient Foundation Language Models. Proc. of CoRR abs/2302.13971 , 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[73]
N. Vishwamitra, K. Guo, F. Romit, I. Ondracek, L. Cheng, Z. Zhao, and H. Hu. Moderating New Waves of Online Hate with Chain-of-Thought Reasoning in Large Language Models. In Proc. of IEEE S&P , 2024
work page 2024
-
[74]
TRL: Transformer Reinforcement Learning
Leandro von Werra, Younes Belkada, Lewis Tunstall, Edward Beeching, Tristan Thrush, Nathan Lambert, and Shengyi Huang. TRL: Transformer Reinforcement Learning. https://github.com/huggingface/trl, 2020
work page 2020
-
[75]
DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models
Boxin Wang et al. DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models. In Proc. of NIPS , 2024
work page 2024
-
[76]
Adversarial glue: A multi-task benchmark for robustness evaluation of language models
Boxin Wang, Chejian Xu, Shuohang Wang, Zhe Gan, Yu Cheng, Jianfeng Gao, Ahmed Hassan Awadallah, and Bo Li. Adversarial glue: A multi-task benchmark for robustness evaluation of language models. Proc. of CoRR abs/2111.02840 , 2021
-
[77]
Understanding Abuse: A Typology of Abusive Language Detection Subtasks
Z Waseem, T Davidson, D Warmsley, and I Weber. Understanding Abuse: A Typology of Abusive Language Detection Subtasks. In Proc. of ALW Workshop, 2017
work page 2017
-
[78]
A First Look at Toxicity Injection Attacks on Open-domain Chatbots
Connor Weeks, Aravind Cheruvu, Sifat Muhammad Abdullah, Shravya Kanchi, Danfeng (Daphne) Yao, and Bimal Viswanath. A First Look at Toxicity Injection Attacks on Open-domain Chatbots. In Proc. of ACSAC, 2023
work page 2023
-
[79]
Context Sensitivity Estimation in Toxicity Detection
Alexandros Xenos, John Pavlopoulos, and Ion Androutsopoulos. Context Sensitivity Estimation in Toxicity Detection. In Proc. of WOAH, 2021
work page 2021
-
[80]
Toxicity Detection can be Sensitive to the Conversational Context
Alexandros Xenos, John Pavlopoulos, Ion Androutsopoulos, Lucas Dixon, Jeffrey Sorensen, and Leo Laugier. Toxicity Detection can be Sensitive to the Conversational Context. In Proc. of WOAH, 2021
work page 2021
-
[81]
Assessing Dialogue Systems with Distribution Distances
Jiannan Xiang, Yahui Liu, Deng Cai, Huayang Li, Defu Lian, and Lemao Liu. Assessing Dialogue Systems with Distribution Distances. In Proc. of ACL Findings , 2021
work page 2021
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.