Benign fine-tuning on audio data breaks safety alignment in Audio LLMs by raising jailbreak success rates up to 87%, with the dominant risk axis depending on model architecture and embedding proximity to harmful content.
hub
Best-of-n jailbreaking
23 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
DecompRL is an RL method that learns modular code decomposition for LLMs, enabling exponential candidate generation via recombination to solve harder coding problems with lower GPU cost.
IRSL applies IRT to reduce scaling law estimation from O(M×N) to O(M+N) parameters, enabling reliable estimates with only 50 questions per benchmark after calibration and generalizable ability scores across related benchmarks.
AudioHijack generates imperceptible adversarial audio via gradient estimation, attention supervision, and reverberation blending to hijack 13 LALMs with 79-96% success on unseen contexts and real commercial agents.
IHO is a new black-box jailbreak attack for LLMs that is adaptive, efficient, transferable across models and behaviors, and effective even against layered defenses without modification.
Babel is an efficient black-box jailbreaking framework that formalizes sparse safety attention heads via a mathematical obfuscation model and uses iterative distribution refinement to achieve higher attack success rates on models like GPT-4o and Claude-3-5-haiku with around 40 queries.
Systematic evaluation of all ordered pairs among twelve jailbreak mutators on harmful prompts reveals mostly destructive interference but some synergistic combinations that raise success rates on three LLMs.
ASR metrics for LLM jailbreaks are inflated by stochasticity; CAS-eval reveals up to 30pp drops under multi-attempt criteria while CAS-gen recovers the performance loss.
Survival analysis applied to repeated jailbreak attacks on three LLMs shows one model degrades rapidly while the others maintain moderate vulnerability on HarmBench prompts.
Harmful prompts reformulated as coherent mathematical problems bypass LLM safety mechanisms at 46-56% rates, with success depending on deep reformulation rather than mere notation.
MechaRule localizes sparse agonist neurons via contrastive hierarchical ablation and adaptive group testing to ground rule extraction, recalling 97% of high-effect activations at 2.14% cost while enabling near-total elimination of target behaviors.
MultiBreak is a large diverse multi-turn jailbreak benchmark that achieves substantially higher attack success rates on LLMs than prior datasets and reveals topic-specific vulnerabilities in multi-turn settings.
Perturbation probing identifies tiny sets of FFN neurons that control refusal templates and language routing in LLMs, enabling precise ablations and directional interventions that alter behavior on benchmarks while preserving safety.
LIRA aligns latent instruction representations in LLMs to defend against jailbreaks, backdoors, and undesired knowledge, blocking over 99% of PEZ attacks and achieving optimal WMDP forgetting.
GRM ranks Mel bands by attack contribution versus utility sensitivity, perturbs a subset, and learns a universal perturbation to reach 88.46% average jailbreak success rate with improved attack-utility trade-off on four audio LLMs.
TPCs allow term-by-term progressive polynomial evaluation on LLM activations for flexible safety monitoring that supports both stronger guardrails and low-cost adaptive cascades.
Formalizes the jailbreak oracle problem for LLMs and introduces Boa, a two-phase breadth-first then depth-first search system to solve it efficiently.
The survey introduces a four-category taxonomy for LALM evaluations and reviews benchmarks across general auditory processing, knowledge reasoning, dialogue, and fairness-safety.
Phonetic perturbations fragment safety-critical tokens in LLMs, suppressing attribution scores while preserving input understanding and causing safety mechanisms to fail despite good comprehension.
Behavioral geometry of model populations enables high-accuracy jailbreak susceptibility prediction and defense transfer with 98% fewer evaluations.
The paper taxonomizes jailbreak attacks and defenses for LLMs, introduces the Security Cube multi-dimensional evaluation framework, benchmarks 13 attacks and 5 defenses, and identifies open challenges in LLM robustness.
LLM safety evaluations are hindered by noise in dataset curation, automated red-teaming, response generation, and LLM-judge evaluation, making fair comparisons difficult and slowing progress.
citing papers explorer
-
Benign Fine-Tuning Breaks Safety Alignment in Audio LLMs
Benign fine-tuning on audio data breaks safety alignment in Audio LLMs by raising jailbreak success rates up to 87%, with the dominant risk axis depending on model architecture and embedding proximity to harmful content.
-
DecompRL: Solving Harder Problems by Learning Modular Code Generation
DecompRL is an RL method that learns modular code decomposition for LLMs, enabling exponential candidate generation via recombination to solve harder coding problems with lower GPU cost.
-
Item Response Scaling Laws: A Measurement Theory Approach for Efficient and Generalizable Neural Scaling Estimation
IRSL applies IRT to reduce scaling law estimation from O(M×N) to O(M+N) parameters, enabling reliable estimates with only 50 questions per benchmark after calibration and generalizable ability scores across related benchmarks.
-
Hijacking Large Audio-Language Models via Context-Agnostic and Imperceptible Auditory Prompt Injection
AudioHijack generates imperceptible adversarial audio via gradient estimation, attention supervision, and reverberation blending to hijack 13 LALMs with 79-96% success on unseen contexts and real commercial agents.
-
Black-box, Adaptive, Efficient, Transferable, Harmful, Applicable... Attacks Are All You Need to Break LLMs
IHO is a new black-box jailbreak attack for LLMs that is adaptive, efficient, transferable across models and behaviors, and effective even against layered defenses without modification.
-
Babel: Jailbreaking Safety Attention via Obfuscation Distribution Optimized Sampling
Babel is an efficient black-box jailbreaking framework that formalizes sparse safety attention heads via a mathematical obfuscation model and uses iterative distribution refinement to achieve higher attack success rates on models like GPT-4o and Claude-3-5-haiku with around 40 queries.
-
Compositional Jailbreaking: An Empirical Analysis of Mutator Chain Interactions in Aligned LLMs
Systematic evaluation of all ordered pairs among twelve jailbreak mutators on harmful prompts reveals mostly destructive interference but some synergistic combinations that raise success rates on three LLMs.
-
The Great Pretender: A Stochasticity Problem in LLM Jailbreak
ASR metrics for LLM jailbreaks are inflated by stochasticity; CAS-eval reveals up to 30pp drops under multi-attempt criteria while CAS-gen recovers the performance loss.
-
Quantifying LLM Safety Degradation Under Repeated Attacks Using Survival Analysis
Survival analysis applied to repeated jailbreak attacks on three LLMs shows one model degrades rapidly while the others maintain moderate vulnerability on HarmBench prompts.
-
Exposing LLM Safety Gaps Through Mathematical Encoding:New Attacks and Systematic Analysis
Harmful prompts reformulated as coherent mathematical problems bypass LLM safety mechanisms at 46-56% rates, with success depending on deep reformulation rather than mere notation.
-
Neuron-Anchored Rule Extraction for Large Language Models via Contrastive Hierarchical Ablation
MechaRule localizes sparse agonist neurons via contrastive hierarchical ablation and adaptive group testing to ground rule extraction, recalling 97% of high-effect activations at 2.14% cost while enabling near-total elimination of target behaviors.
-
MultiBreak: A Scalable and Diverse Multi-turn Jailbreak Benchmark for Evaluating LLM Safety
MultiBreak is a large diverse multi-turn jailbreak benchmark that achieves substantially higher attack success rates on LLMs than prior datasets and reveals topic-specific vulnerabilities in multi-turn settings.
-
Perturbation Probing: A Two-Pass-per-Prompt Diagnostic for FFN Behavioral Circuits in Aligned LLMs
Perturbation probing identifies tiny sets of FFN neurons that control refusal templates and language routing in LLMs, enabling precise ablations and directional interventions that alter behavior on benchmarks while preserving safety.
-
Latent Instruction Representation Alignment: defending against jailbreaks, backdoors and undesired knowledge in LLMs
LIRA aligns latent instruction representations in LLMs to defend against jailbreaks, backdoors, and undesired knowledge, blocking over 99% of PEZ attacks and achieving optimal WMDP forgetting.
-
GRM: Utility-Aware Jailbreak Attacks on Audio LLMs via Gradient-Ratio Masking
GRM ranks Mel bands by attack contribution versus utility sensitivity, perturbs a subset, and learns a universal perturbation to reach 88.46% average jailbreak success rate with improved attack-utility trade-off on four audio LLMs.
-
Beyond Linear Probes: Dynamic Safety Monitoring for Language Models
TPCs allow term-by-term progressive polynomial evaluation on LLM activations for flexible safety monitoring that supports both stronger guardrails and low-cost adaptive cascades.
-
Toward Principled LLM Safety Testing: Solving the Jailbreak Oracle Problem
Formalizes the jailbreak oracle problem for LLMs and introduces Boa, a two-phase breadth-first then depth-first search system to solve it efficiently.
-
Towards Holistic Evaluation of Large Audio-Language Models: A Comprehensive Survey
The survey introduces a four-category taxonomy for LALM evaluations and reviews benchmarks across general auditory processing, knowledge reasoning, dialogue, and fairness-safety.
-
Phonetic Perturbations Reveal Tokenizer-Rooted Safety Gaps in LLMs
Phonetic perturbations fragment safety-critical tokens in LLMs, suppressing attribution scores while preserving input understanding and causing safety mechanisms to fail despite good comprehension.
-
Jailbreak susceptibility prediction and mitigation via the behavioral geometry of models
Behavioral geometry of model populations enables high-accuracy jailbreak susceptibility prediction and defense transfer with 98% fewer evaluations.
-
SoK: Robustness in Large Language Models against Jailbreak Attacks
The paper taxonomizes jailbreak attacks and defenses for LLMs, introduces the Security Cube multi-dimensional evaluation framework, benchmarks 13 attacks and 5 defenses, and identifies open challenges in LLM robustness.
-
LLM-Safety Evaluations Lack Robustness
LLM safety evaluations are hindered by noise in dataset curation, automated red-teaming, response generation, and LLM-judge evaluation, making fair comparisons difficult and slowing progress.
- Estimating Tail Risks in Language Model Output Distributions