ToxicChat: Unveiling Hidden Challenges of Toxicity Detection in Real-World User-AI Conversation

Jingbo Shang; Yangkun Wang; Yongqi Tong; Yujia Wang; Yuxin Guo; Zihan Wang; Zi Lin

arxiv: 2310.17389 · v1 · pith:LILDZUIXnew · submitted 2023-10-26 · 💻 cs.CL · cs.AI

ToxicChat: Unveiling Hidden Challenges of Toxicity Detection in Real-World User-AI Conversation

Zi Lin , Zihan Wang , Yongqi Tong , Yangkun Wang , Yuxin Guo , Yujia Wang , Jingbo Shang This is my paper

classification 💻 cs.CL cs.AI

keywords toxicityuser-aidetectiontoxicchatchallengesmodelsreal-worldbenchmark

0 comments

read the original abstract

Despite remarkable advances that large language models have achieved in chatbots, maintaining a non-toxic user-AI interactive environment has become increasingly critical nowadays. However, previous efforts in toxicity detection have been mostly based on benchmarks derived from social media content, leaving the unique challenges inherent to real-world user-AI interactions insufficiently explored. In this work, we introduce ToxicChat, a novel benchmark based on real user queries from an open-source chatbot. This benchmark contains the rich, nuanced phenomena that can be tricky for current toxicity detection models to identify, revealing a significant domain difference compared to social media content. Our systematic evaluation of models trained on existing toxicity datasets has shown their shortcomings when applied to this unique domain of ToxicChat. Our work illuminates the potentially overlooked challenges of toxicity detection in real-world user-AI conversations. In the future, ToxicChat can be a valuable resource to drive further advancements toward building a safe and healthy environment for user-AI interactions.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 13 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Reliable Self-Harm Risk Screening via Adaptive Multi-Agent LLM Systems
cs.LG 2026-04 unverdicted novelty 7.0

Adaptive multi-agent LLM pipelines with bandit-based sampling achieve lower false positive rates (0.095 vs 0.159) than single-agent models on two behavioral health datasets while maintaining similar false negative rates.
Defending Jailbreak Attacks on Large Language Models via Manifold Trajectory Kinetics
cs.CR 2026-06 unverdicted novelty 6.0

MTK detects jailbreaks by monitoring the evolution of prompt neighborhood structures on the data manifold through LLM layers, reporting 95% TPR at 5% FPR on benign and 2% on pseudo-malicious prompts plus 85% TPR under...
When Youth Enter the Algorithmic Wild: Discovering and Understanding Potentially Harmful Teen Videos on Douyin and Kwai
cs.CR 2026-05 unverdicted novelty 6.0

PHTV-Scout measures 6.11% prevalence of potentially harmful teen videos on Douyin and Kwai (53.2% child sexual exploitation imagery), shows Youth Mode blocks all such content but is used by only 30-41% of teens, and a...
Navigating the Sea of LLM Evaluation: Investigating Bias in Toxicity Benchmarks
cs.AI 2026-05 unverdicted novelty 6.0

Toxicity benchmarks for LLMs produce inconsistent results when task type, input domain, or model changes, revealing intrinsic evaluation biases.
Breaking Bad: Interpretability-Based Safety Audits of State-of-the-Art LLMs
cs.CR 2026-04 unverdicted novelty 6.0

Interpretability steering attacks succeed at high rates on Llama-3 models but fail on GPT-oss-120B, showing uneven robustness across tested LLMs.
LLM Safety From Within: Detecting Harmful Content with Internal Representations
cs.AI 2026-04 unverdicted novelty 6.0

SIREN identifies safety neurons via linear probing on internal LLM layers and combines them with adaptive weighting to detect harm, outperforming prior guard models with 250x fewer parameters.
From Refusal to Recovery: A Control-Theoretic Approach to Generative AI Guardrails
cs.AI 2025-10 unverdicted novelty 6.0

Control-theoretic guardrails enable proactive correction of risky LLM agent actions in latent space, preventing catastrophes like collisions or bankruptcy while preserving task performance in simulated environments.
WildGuard: Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs
cs.CL 2024-06 conditional novelty 6.0

WildGuard is a new open moderation model and dataset for LLM safety that identifies harmful prompts, risky responses, and refusal rates, achieving SOTA open-source performance and sometimes exceeding GPT-4 while cutti...
HaloGuard 1.0: An Open Weights Constitutional Classifier for Multilingual AI Safety
cs.CL 2026-07 unverdicted novelty 5.0

HaloGuard 1.0-0.8B achieves the highest average F1 of 90.9 across seven prompt-safety benchmarks among evaluated open guard models while keeping FPR at 4.3 and FNR at 9.5, with a 4B variant reaching 92.1 F1.
Guardian-as-an-Advisor: Advancing Next-Generation Guardian Models for Trustworthy LLMs
cs.LG 2026-04 unverdicted novelty 5.0

Guardian-as-an-Advisor prepends risk labels and explanations from a guardian model to queries, improving LLM safety compliance and reducing over-refusal while adding minimal compute overhead.
FedDetox: Robust Federated SLM Alignment via On-Device Data Sanitization
cs.CR 2026-04 unverdicted novelty 5.0

FedDetox uses on-device knowledge-distilled classifiers to sanitize toxic data in federated SLM training, preserving safety alignment comparable to centralized baselines.
TWGuard: A Case Study of LLM Safety Guardrails for Localized Linguistic Contexts
cs.CR 2026-04 unverdicted novelty 4.0

TWGuard achieves +0.289 F1 improvement and 94.9% false-positive reduction for LLM safety guardrails in the Taiwan linguistic context compared to foundation models and baselines.
ShieldGemma: Generative AI Content Moderation Based on Gemma
cs.CL 2024-07 unverdicted novelty 4.0

ShieldGemma delivers a family of Gemma2-based classifiers that outperform Llama Guard and WildCard on public safety benchmarks while introducing a synthetic-data curation pipeline for safety tasks.