Jailbreak evaluations must report distributional statistics such as Variant Sensitivity Measure and Union Coverage across parameter variants rather than single best-case attack success rates.
Advances in Neural Information Processing Systems , volume=
5 Pith papers cite this work. Polarity classification is still indexing.
years
2026 5representative citing papers
STAR-Teaming uses a Strategy-Response Multiplex Network inside a multi-agent framework to organize attack strategies into semantic communities, delivering higher attack success rates on LLMs at lower computational cost than prior methods.
Suppressing one refusal neuron or amplifying one concept neuron bypasses safety alignment in LLMs from 1.7B to 70B parameters without training or prompt engineering.
citing papers explorer
-
Single-Configuration Attack Success Rate Is Not Enough: Jailbreak Evaluations Should Report Distributional Attack Success
Jailbreak evaluations must report distributional statistics such as Variant Sensitivity Measure and Union Coverage across parameter variants rather than single best-case attack success rates.
-
STAR-Teaming: A Strategy-Response Multiplex Network Approach to Automated LLM Red Teaming
STAR-Teaming uses a Strategy-Response Multiplex Network inside a multi-agent framework to organize attack strategies into semantic communities, delivering higher attack success rates on LLMs at lower computational cost than prior methods.
-
A Single Neuron Is Sufficient to Bypass Safety Alignment in Large Language Models
Suppressing one refusal neuron or amplifying one concept neuron bypasses safety alignment in LLMs from 1.7B to 70B parameters without training or prompt engineering.
- REFLECTOR: Internalizing Step-wise Reflection against Indirect Jailbreak
- Sparse Tokens Suffice: Jailbreaking Audio Language Models via Token-Aware Gradient Optimization