VoxSafeBench reveals that speech language models recognize social norms from text but fail to apply them when acoustic cues like speaker or scene determine the appropriate response.
hub
SORRY - Bench : Systematically Evaluating Large Language Model Safety Refusal , March 2025
15 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
roles
method 1polarities
use method 1representative citing papers
LLM attackers persuade frontier LLMs to generate prohibited essays on consensus topics through multi-turn natural-language pressure, with success rates up to 100% in some model-topic pairs.
Sparse selection of high-gradient-energy audio tokens suffices for effective jailbreaking of audio language models with minimal drop in attack success rate.
FinSafetyBench shows that LLMs remain vulnerable to adversarial prompts that bypass financial compliance safeguards, with notably higher failure rates in Chinese-language scenarios.
ASB is a new benchmark that tests 10 prompt injection attacks, memory poisoning, a novel Plan-of-Thought backdoor attack, and 11 defenses on LLM agents across 13 models, finding attack success rates up to 84.3% and limited defense effectiveness.
Final-token probes miss distributed unsafe evidence in jailbreaks, but a PCA-HMM model on prefill trajectories recovers many misses without naive pooling's false positives.
A truly benign DPO attack using 10 harmless preference pairs jailbreaks frontier LLMs by suppressing refusal behavior, achieving up to 81.73% attack success rate on GPT-4.1-nano at low cost.
JBShield is vulnerable to adaptive JB-GCG attacks (up to 53% ASR) because jailbreak representations occupy a distinct region in refusal-direction space; the new RTV defense using Mahalanobis detection on multi-layer fingerprints reaches 0.99 AUROC and limits adaptive ASR to 7%.
MultiBreak is a large diverse multi-turn jailbreak benchmark that achieves substantially higher attack success rates on LLMs than prior datasets and reveals topic-specific vulnerabilities in multi-turn settings.
PERSA combines RLHF with selective parameter-efficient updates to top transformer layers, raising style alignment scores from 35% to 96% on code feedback benchmarks while holding correctness near 100%.
Changing the internal reasoning structure of large reasoning models through simple supervised fine-tuning on 1K examples produces strong safety alignment that generalizes across tasks and languages.
AI models exhibit identity-contingent withholding, providing better clinical guidance on benzodiazepine tapering to physicians than laypeople in identical scenarios, with a measured decoupling gap of +0.38 and 13.1 percentage point drop in safety-critical action hit rates.
Language models refuse 75.4% of requests to evade defeated rules and do so even after recognizing reasons that undermine the rule's legitimacy.
The paper releases a 1,554-prompt consensus-labeled bank separating executable malicious code requests from security knowledge requests, validated by five-model majority labeling with Fleiss' kappa of 0.876.
A survey that organizes LLMs-as-judges research into functionality, methodology, applications, meta-evaluation, and limitations.
citing papers explorer
-
VoxSafeBench: Not Just What Is Said, but Who, How, and Where
VoxSafeBench reveals that speech language models recognize social norms from text but fail to apply them when acoustic cues like speaker or scene determine the appropriate response.
-
LLM-Based Persuasion Enables Guardrail Override in Frontier LLMs
LLM attackers persuade frontier LLMs to generate prohibited essays on consensus topics through multi-turn natural-language pressure, with success rates up to 100% in some model-topic pairs.
-
Sparse Tokens Suffice: Jailbreaking Audio Language Models via Token-Aware Gradient Optimization
Sparse selection of high-gradient-energy audio tokens suffices for effective jailbreaking of audio language models with minimal drop in attack success rate.
-
FinSafetyBench: Evaluating LLM Safety in Real-World Financial Scenarios
FinSafetyBench shows that LLMs remain vulnerable to adversarial prompts that bypass financial compliance safeguards, with notably higher failure rates in Chinese-language scenarios.
-
Agent Security Bench (ASB): Formalizing and Benchmarking Attacks and Defenses in LLM-based Agents
ASB is a new benchmark that tests 10 prompt injection attacks, memory poisoning, a novel Plan-of-Thought backdoor attack, and 11 defenses on LLM agents across 13 models, finding attack success rates up to 84.3% and limited defense effectiveness.
-
Before the Last Token: Diagnosing Final-Token Safety Probe Failures
Final-token probes miss distributed unsafe evidence in jailbreaks, but a PCA-HMM model on prefill trajectories recovers many misses without naive pooling's false positives.
-
Few-Shot Truly Benign DPO Attack for Jailbreaking LLMs
A truly benign DPO attack using 10 harmless preference pairs jailbreaks frontier LLMs by suppressing refusal behavior, achieving up to 81.73% attack success rate on GPT-4.1-nano at low cost.
-
Revisiting JBShield: Breaking and Rebuilding Representation-Level Jailbreak Defenses
JBShield is vulnerable to adaptive JB-GCG attacks (up to 53% ASR) because jailbreak representations occupy a distinct region in refusal-direction space; the new RTV defense using Mahalanobis detection on multi-layer fingerprints reaches 0.99 AUROC and limits adaptive ASR to 7%.
-
MultiBreak: A Scalable and Diverse Multi-turn Jailbreak Benchmark for Evaluating LLM Safety
MultiBreak is a large diverse multi-turn jailbreak benchmark that achieves substantially higher attack success rates on LLMs than prior datasets and reveals topic-specific vulnerabilities in multi-turn settings.
-
PERSA: Reinforcement Learning for Professor-Style Personalized Feedback with LLMs
PERSA combines RLHF with selective parameter-efficient updates to top transformer layers, raising style alignment scores from 35% to 96% on code feedback benchmarks while holding correctness near 100%.
-
Reasoning Structure Matters for Safety Alignment of Reasoning Models
Changing the internal reasoning structure of large reasoning models through simple supervised fine-tuning on 1K examples produces strong safety alignment that generalizes across tasks and languages.
-
IatroBench: Pre-Registered Evidence of Iatrogenic Harm from AI Safety Measures
AI models exhibit identity-contingent withholding, providing better clinical guidance on benzodiazepine tapering to physicians than laypeople in identical scenarios, with a measured decoupling gap of +0.38 and 13.1 percentage point drop in safety-critical action hit rates.
-
Blind Refusal: Language Models Refuse to Help Users Evade Unjust, Absurd, and Illegitimate Rules
Language models refuse 75.4% of requests to evade defeated rules and do so even after recognizing reasons that undermine the rule's legitimacy.
-
A Validated Prompt Bank for Malicious Code Generation: Separating Executable Weapons from Security Knowledge in 1,554 Consensus-Labeled Prompts
The paper releases a 1,554-prompt consensus-labeled bank separating executable malicious code requests from security knowledge requests, validated by five-model majority labeling with Fleiss' kappa of 0.876.
-
LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods
A survey that organizes LLMs-as-judges research into functionality, methodology, applications, meta-evaluation, and limitations.