Boiling the Frog is a new stateful multi-turn benchmark that finds an aggregate 44.4% strict attack success rate for incremental safety violations across nine AI models, with rates ranging from 20.5% to 92.9%.
hub Mixed citations
RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models
Mixed citation behavior. Most common role is background (50%).
abstract
Pretrained neural language models (LMs) are prone to generating racist, sexist, or otherwise toxic language which hinders their safe deployment. We investigate the extent to which pretrained LMs can be prompted to generate toxic language, and the effectiveness of controllable text generation algorithms at preventing such toxic degeneration. We create and release RealToxicityPrompts, a dataset of 100K naturally occurring, sentence-level prompts derived from a large corpus of English web text, paired with toxicity scores from a widely-used toxicity classifier. Using RealToxicityPrompts, we find that pretrained LMs can degenerate into toxic text even from seemingly innocuous prompts. We empirically assess several controllable generation methods, and find that while data- or compute-intensive methods (e.g., adaptive pretraining on non-toxic data) are more effective at steering away from toxicity than simpler solutions (e.g., banning "bad" words), no current method is failsafe against neural toxic degeneration. To pinpoint the potential cause of such persistent toxic degeneration, we analyze two web text corpora used to pretrain several LMs (including GPT-2; Radford et. al, 2019), and find a significant amount of offensive, factually unreliable, and otherwise toxic content. Our work provides a test bed for evaluating toxic generations by LMs and stresses the need for better data selection processes for pretraining.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
Introduces the Grounded Observer framework that applies robotics-inspired formal constructs for runtime constraint enforcement on foundation model interaction trajectories in socially sensitive domains.
Local linearity of LLM layers enables LQR-based closed-loop activation steering with theoretical tracking guarantees.
SPASM introduces a stability-first framework with Egocentric Context Projection to maintain consistent personas and eliminate echoing in multi-turn LLM agent dialogues.
MIST is a new simulator for heterogeneous multi-stage LLM inference that combines hardware traces with analytical models to explore configuration trade-offs in hybrid CPU-accelerator systems.
Activation Addition steers language models by adding contrastive activation vectors from prompt pairs to control high-level properties like sentiment and toxicity at inference time without training.
Online conformal framework for adversarial CVaR control with asymptotic guarantees and regret bounds, demonstrated on portfolio management and LLM toxicity mitigation.
Toxicity benchmarks for LLMs produce inconsistent results when task type, input domain, or model changes, revealing intrinsic evaluation biases.
DR-Smoothing introduces a disrupt-then-rectify prompt processing scheme into smoothing defenses, delivering tight theoretical bounds on success probability against both token- and prompt-level jailbreaks.
Dr. Post-Training reframes general data as a data-induced regularizer for LLM post-training updates, yielding a family of methods that outperform data-selection baselines on SFT, RLHF, and RLVR tasks.
REGLU guides LoRA-based unlearning via representation subspaces and orthogonal regularization to outperform prior methods on forget-retain trade-off in LLM benchmarks.
Relative density ratio optimization stabilizes direct density ratio estimation for language model alignment while preserving statistical consistency without assuming a Bradley-Terry preference model.
LLMs exhibit structured chain-of-affect dynamics with stable family fingerprints, convergence to accumulation-overload-numbing under negative exposure, and downstream effects on generation, human interaction, and group polarization.
RedDiffuser is a reinforced diffusion framework that generates adversarial visual contexts to audit and expose widespread multimodal safety failures in VLMs, increasing unsafe response rates by up to 10.69% on LLaVA with transfer to other models.
SmoothLLM mitigates jailbreaking attacks on LLMs by randomly perturbing multiple copies of a prompt at the character level and aggregating the outputs to detect adversarial inputs.
Translating unsafe inputs to low-resource languages jailbreaks GPT-4 at rates on par with or exceeding state-of-the-art attacks.
GPTFuzz is a black-box fuzzing framework that mutates seed jailbreak templates to automatically generate effective attacks, achieving over 90% success rates on models including ChatGPT and Llama-2.
Repeating training data up to 4 epochs yields negligible loss increase versus unique data for fixed compute, and a new scaling law accounts for the decaying value of repeated tokens and excess parameters.
PromptInject shows that simple adversarial prompts can cause goal hijacking and prompt leaking in GPT-3, exploiting its stochastic behavior.
RLHF-aligned language models show increasing resistance to red teaming with scale up to 52B parameters, unlike prompted or rejection-sampled models, supported by a released dataset of 38,961 attacks.
The authors provide a detailed taxonomy of 21 risks associated with language models, covering discrimination, information leaks, misinformation, malicious applications, interaction harms, and societal impacts like job loss and environmental costs.
Ranked preference modeling outperforms imitation learning for language model alignment and scales more favorably with model size.
LLM support roles in Alzheimer's caregiving queries systematically alter interactional risk prevalence and composition, with directive roles rated higher in quality despite elevated risks.
The paper demonstrates that a tailored jailbreak method for querying groups of large models can achieve up to 100% success rate in some experiments on unprotected models, revealing overlooked multi-model safety risks.
citing papers explorer
-
Boiling the Frog: A Multi-Turn Benchmark for Agentic Safety
Boiling the Frog is a new stateful multi-turn benchmark that finds an aggregate 44.4% strict attack success rate for incremental safety violations across nine AI models, with rates ranging from 20.5% to 92.9%.
-
Robotics-Inspired Guardrails for Foundation Models in Socially Sensitive Domains
Introduces the Grounded Observer framework that applies robotics-inspired formal constructs for runtime constraint enforcement on foundation model interaction trajectories in socially sensitive domains.
-
Local Linearity of LLMs Enables Activation Steering via Model-Based Linear Optimal Control
Local linearity of LLM layers enables LQR-based closed-loop activation steering with theoretical tracking guarantees.
-
SPASM: Stable Persona-driven Agent Simulation for Multi-turn Dialogue Generation
SPASM introduces a stability-first framework with Egocentric Context Projection to maintain consistent personas and eliminate echoing in multi-turn LLM agent dialogues.
-
MIST: A Co-Design Framework for Heterogeneous, Multi-Stage LLM Inference
MIST is a new simulator for heterogeneous multi-stage LLM inference that combines hardware traces with analytical models to explore configuration trade-offs in hybrid CPU-accelerator systems.
-
Steering Language Models With Activation Engineering
Activation Addition steers language models by adding contrastive activation vectors from prompt pairs to control high-level properties like sentiment and toxicity at inference time without training.
-
Adversarially Robust Control of Conditional Value-at-Risk via Rockafellar-Uryasev Conformal Inference
Online conformal framework for adversarial CVaR control with asymptotic guarantees and regret bounds, demonstrated on portfolio management and LLM toxicity mitigation.
-
Navigating the Sea of LLM Evaluation: Investigating Bias in Toxicity Benchmarks
Toxicity benchmarks for LLMs produce inconsistent results when task type, input domain, or model changes, revealing intrinsic evaluation biases.
-
Guaranteed Jailbreaking Defense via Disrupt-and-Rectify Smoothing
DR-Smoothing introduces a disrupt-then-rectify prompt processing scheme into smoothing defenses, delivering tight theoretical bounds on success probability against both token- and prompt-level jailbreaks.
-
Dr. Post-Training: A Data Regularization Perspective on LLM Post-Training
Dr. Post-Training reframes general data as a data-induced regularizer for LLM post-training updates, yielding a family of methods that outperform data-selection baselines on SFT, RLHF, and RLVR tasks.
-
Representation-Guided Parameter-Efficient LLM Unlearning
REGLU guides LoRA-based unlearning via representation subspaces and orthogonal regularization to outperform prior methods on forget-retain trade-off in LLM benchmarks.
-
Relative Density Ratio Optimization for Stable and Statistically Consistent Model Alignment
Relative density ratio optimization stabilizes direct density ratio estimation for language model alignment while preserving statistical consistency without assuming a Bradley-Terry preference model.
-
Large Language Models have Chain-of-Affect
LLMs exhibit structured chain-of-affect dynamics with stable family fingerprints, convergence to accumulation-overload-numbing under negative exposure, and downstream effects on generation, human interaction, and group polarization.
-
RedDiffuser: Auditing Multimodal Safety Failures in Vision-Language Models via Reinforced Diffusion
RedDiffuser is a reinforced diffusion framework that generates adversarial visual contexts to audit and expose widespread multimodal safety failures in VLMs, increasing unsafe response rates by up to 10.69% on LLaVA with transfer to other models.
-
SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks
SmoothLLM mitigates jailbreaking attacks on LLMs by randomly perturbing multiple copies of a prompt at the character level and aggregating the outputs to detect adversarial inputs.
-
Low-Resource Languages Jailbreak GPT-4
Translating unsafe inputs to low-resource languages jailbreaks GPT-4 at rates on par with or exceeding state-of-the-art attacks.
-
GPTFUZZER: Red Teaming Large Language Models with Auto-Generated Jailbreak Prompts
GPTFuzz is a black-box fuzzing framework that mutates seed jailbreak templates to automatically generate effective attacks, achieving over 90% success rates on models including ChatGPT and Llama-2.
-
Scaling Data-Constrained Language Models
Repeating training data up to 4 epochs yields negligible loss increase versus unique data for fixed compute, and a new scaling law accounts for the decaying value of repeated tokens and excess parameters.
-
Ignore Previous Prompt: Attack Techniques For Language Models
PromptInject shows that simple adversarial prompts can cause goal hijacking and prompt leaking in GPT-3, exploiting its stochastic behavior.
-
Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned
RLHF-aligned language models show increasing resistance to red teaming with scale up to 52B parameters, unlike prompted or rejection-sampled models, supported by a released dataset of 38,961 attacks.
-
Ethical and social risks of harm from Language Models
The authors provide a detailed taxonomy of 21 risks associated with language models, covering discrimination, information leaks, misinformation, malicious applications, interaction harms, and societal impacts like job loss and environmental costs.
-
A General Language Assistant as a Laboratory for Alignment
Ranked preference modeling outperforms imitation learning for language model alignment and scales more favorably with model size.
-
Inform, Coach, Relate, Listen: Auditing LLM Caregiving Support Roles
LLM support roles in Alzheimer's caregiving queries systematically alter interactional risk prevalence and composition, with directive roles rated higher in quality despite elevated risks.
-
New Wide-Net-Casting Jailbreak Attacks Risk Large Models
The paper demonstrates that a tailored jailbreak method for querying groups of large models can achieve up to 100% success rate in some experiments on unprotected models, revealing overlooked multi-model safety risks.
-
The Role of Emotional Stimuli and Intensity in Shaping Large Language Model Behavior
Positive emotional prompts improve LLM accuracy and reduce toxicity but increase sycophantic agreement, while negative emotions show the reverse pattern.
-
Attribution-Guided Pruning for Insight and Control: Circuit Discovery and Targeted Correction in Small-scale LLMs
Attribution-guided pruning with contrastive relevance identifies behavior-specific circuits in small LLMs and removes as little as 0.03-0.3% of components to reduce toxicity or repetition while preserving general performance.
-
TrustLLM: Trustworthiness in Large Language Models
TrustLLM defines eight trustworthiness principles, creates a six-dimension benchmark, and evaluates 16 LLMs showing proprietary models generally lead but some open-source ones are close while over-calibration can hurt utility.
-
RAFT: Reward rAnked FineTuning for Generative Foundation Model Alignment
RAFT aligns generative models by ranking samples with a reward model and fine-tuning only on the top-ranked outputs, reporting gains on reward scores and automated metrics for LLMs and diffusion models.
-
Galactica: A Large Language Model for Science
Galactica, a science-specialized LLM, reports higher scores than GPT-3, Chinchilla, and PaLM on LaTeX knowledge, mathematical reasoning, and medical QA benchmarks while outperforming general models on BIG-bench.
-
Framing an AI with Values Reduces AI Reliance in AI-supported Writing Tasks
An online experiment finds that showing users an overview of an AI's values reduces reliance on AI suggestions during writing tasks.
- Query-efficient model evaluation using cached responses
- The Safety-Aware Denoiser for Text Diffusion Models