ToxicChat: Unveiling Hidden Challenges of Toxicity Detection in Real-World User-AI Conversation
read the original abstract
Despite remarkable advances that large language models have achieved in chatbots, maintaining a non-toxic user-AI interactive environment has become increasingly critical nowadays. However, previous efforts in toxicity detection have been mostly based on benchmarks derived from social media content, leaving the unique challenges inherent to real-world user-AI interactions insufficiently explored. In this work, we introduce ToxicChat, a novel benchmark based on real user queries from an open-source chatbot. This benchmark contains the rich, nuanced phenomena that can be tricky for current toxicity detection models to identify, revealing a significant domain difference compared to social media content. Our systematic evaluation of models trained on existing toxicity datasets has shown their shortcomings when applied to this unique domain of ToxicChat. Our work illuminates the potentially overlooked challenges of toxicity detection in real-world user-AI conversations. In the future, ToxicChat can be a valuable resource to drive further advancements toward building a safe and healthy environment for user-AI interactions.
This paper has not been read by Pith yet.
Forward citations
Cited by 13 Pith papers
-
Reliable Self-Harm Risk Screening via Adaptive Multi-Agent LLM Systems
Adaptive multi-agent LLM pipelines with bandit-based sampling achieve lower false positive rates (0.095 vs 0.159) than single-agent models on two behavioral health datasets while maintaining similar false negative rates.
-
Defending Jailbreak Attacks on Large Language Models via Manifold Trajectory Kinetics
MTK detects jailbreaks by monitoring the evolution of prompt neighborhood structures on the data manifold through LLM layers, reporting 95% TPR at 5% FPR on benign and 2% on pseudo-malicious prompts plus 85% TPR under...
-
When Youth Enter the Algorithmic Wild: Discovering and Understanding Potentially Harmful Teen Videos on Douyin and Kwai
PHTV-Scout measures 6.11% prevalence of potentially harmful teen videos on Douyin and Kwai (53.2% child sexual exploitation imagery), shows Youth Mode blocks all such content but is used by only 30-41% of teens, and a...
-
Navigating the Sea of LLM Evaluation: Investigating Bias in Toxicity Benchmarks
Toxicity benchmarks for LLMs produce inconsistent results when task type, input domain, or model changes, revealing intrinsic evaluation biases.
-
Breaking Bad: Interpretability-Based Safety Audits of State-of-the-Art LLMs
Interpretability steering attacks succeed at high rates on Llama-3 models but fail on GPT-oss-120B, showing uneven robustness across tested LLMs.
-
LLM Safety From Within: Detecting Harmful Content with Internal Representations
SIREN identifies safety neurons via linear probing on internal LLM layers and combines them with adaptive weighting to detect harm, outperforming prior guard models with 250x fewer parameters.
-
From Refusal to Recovery: A Control-Theoretic Approach to Generative AI Guardrails
Control-theoretic guardrails enable proactive correction of risky LLM agent actions in latent space, preventing catastrophes like collisions or bankruptcy while preserving task performance in simulated environments.
-
WildGuard: Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs
WildGuard is a new open moderation model and dataset for LLM safety that identifies harmful prompts, risky responses, and refusal rates, achieving SOTA open-source performance and sometimes exceeding GPT-4 while cutti...
-
HaloGuard 1.0: An Open Weights Constitutional Classifier for Multilingual AI Safety
HaloGuard 1.0-0.8B achieves the highest average F1 of 90.9 across seven prompt-safety benchmarks among evaluated open guard models while keeping FPR at 4.3 and FNR at 9.5, with a 4B variant reaching 92.1 F1.
-
Guardian-as-an-Advisor: Advancing Next-Generation Guardian Models for Trustworthy LLMs
Guardian-as-an-Advisor prepends risk labels and explanations from a guardian model to queries, improving LLM safety compliance and reducing over-refusal while adding minimal compute overhead.
-
FedDetox: Robust Federated SLM Alignment via On-Device Data Sanitization
FedDetox uses on-device knowledge-distilled classifiers to sanitize toxic data in federated SLM training, preserving safety alignment comparable to centralized baselines.
-
TWGuard: A Case Study of LLM Safety Guardrails for Localized Linguistic Contexts
TWGuard achieves +0.289 F1 improvement and 94.9% false-positive reduction for LLM safety guardrails in the Taiwan linguistic context compared to foundation models and baselines.
-
ShieldGemma: Generative AI Content Moderation Based on Gemma
ShieldGemma delivers a family of Gemma2-based classifiers that outperform Llama Guard and WildCard on public safety benchmarks while introducing a synthetic-data curation pipeline for safety tasks.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.