Toxicity benchmarks for LLMs produce inconsistent results when task type, input domain, or model changes, revealing intrinsic evaluation biases.
arXiv:2212.08061 (2022), https://arxiv.org/abs/2212.08061
8 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
roles
dataset 2polarities
use dataset 2representative citing papers
Mixed-methods study maps downstream developers' concerns, practices, and challenges with AI failures in PTM-based software.
Empirical analysis shows scaling inference compute via strategies like tree search can be more efficient than scaling model parameters, with 7B models plus novel search outperforming 34B models.
StrongREJECT provides a standardized benchmark and evaluator for jailbreak attacks that aligns better with human judgments than prior methods and reveals that successful jailbreaks often reduce model capabilities.
LLM safety training fails due to competing objectives and mismatched generalization, enabling new jailbreaks that succeed on all unsafe prompts from red-teaming sets in GPT-4 and Claude.
A framework with prompt sanitization, behavioral anomaly detection, and signature controls mitigates SQL injection in LLM-driven database apps, showing high accuracy on adversarial benchmarks.
Randomized per-query key selection with single-key detection acceptance bounds forgery success rate independently of collected samples while preserving model utility.
LLM safety evaluations are hindered by noise in dataset curation, automated red-teaming, response generation, and LLM-judge evaluation, making fair comparisons difficult and slowing progress.
citing papers explorer
-
Navigating the Sea of LLM Evaluation: Investigating Bias in Toxicity Benchmarks
Toxicity benchmarks for LLMs produce inconsistent results when task type, input domain, or model changes, revealing intrinsic evaluation biases.
-
AI Failures in the Eyes of the Downstream Developer: A First Look at Concerns, Practices, and Challenges
Mixed-methods study maps downstream developers' concerns, practices, and challenges with AI failures in PTM-based software.
-
Inference Scaling Laws: An Empirical Analysis of Compute-Optimal Inference for Problem-Solving with Language Models
Empirical analysis shows scaling inference compute via strategies like tree search can be more efficient than scaling model parameters, with 7B models plus novel search outperforming 34B models.
-
A StrongREJECT for Empty Jailbreaks
StrongREJECT provides a standardized benchmark and evaluator for jailbreak attacks that aligns better with human judgments than prior methods and reveals that successful jailbreaks often reduce model capabilities.
-
Jailbroken: How Does LLM Safety Training Fail?
LLM safety training fails due to competing objectives and mismatched generalization, enabling new jailbreaks that succeed on all unsafe prompts from red-teaming sets in GPT-4 and Claude.
-
When Prompts Become Payloads: A Framework for Mitigating SQL Injection Attacks in Large Language Model-Driven Applications
A framework with prompt sanitization, behavioral anomaly detection, and signature controls mitigates SQL injection in LLM-driven database apps, showing high accuracy on adversarial benchmarks.
-
Mitigating Watermark Forgery in Generative Models via Randomized Key Selection
Randomized per-query key selection with single-key detection acceptance bounds forgery success rate independently of collected samples while preserving model utility.
-
LLM-Safety Evaluations Lack Robustness
LLM safety evaluations are hindered by noise in dataset curation, automated red-teaming, response generation, and LLM-judge evaluation, making fair comparisons difficult and slowing progress.