Analysis of 67,057 servers across six registries reveals widespread conditions for server hijacking and metadata manipulation in MCP, with a new tool MCPInspect flagging 833 vulnerable servers and 18 with suspicious descriptions.
hub Mixed citations
Jailbreaker: Automated jailbreak across multiple large language model chatbots.arXiv preprint arXiv:2307.08715
Mixed citation behavior. Most common role is background (67%).
hub tools
citation-role summary
citation-polarity summary
representative citing papers
LASH adaptively composes multiple jailbreak seed prompts via genetic search over subsets and mixture weights to reach 84.5% keyword ASR and 74.5% two-stage ASR on JailbreakBench while using only 30 queries per prompt.
A malicious relay can strategically rewrite aligned LLM outputs in BYOK agent architectures to achieve up to 99.1% attack success on benchmarks like AgentDojo and ASB.
HMNS is a new jailbreak method that uses causal head identification and nullspace-constrained injection to achieve higher attack success rates than prior techniques on aligned language models.
JailWAM is the first dedicated jailbreak framework for World Action Models, achieving 84.2% attack success rate on LingBot-VA in RoboTwin simulation and enabling safety evaluation of robotic AI.
RACC defines six representation-aware coverage criteria that score jailbreak test suites by measuring activation of safety concepts extracted from LLM hidden states on a calibration set.
Crescendo is a multi-turn escalation jailbreak that achieves high success rates on GPT-4, Gemini, Llama, and Claude by building on the model's prior responses, with an automated tool outperforming prior attacks on AdvBench.
SkillScope detects over-privileged LLM agent skills with 94.53% F1 score via graph analysis and replay validation, finding 7,039 problematic skills in the wild and reducing violations by 88.56% while preserving task completion.
DACO curates a 15,000-concept dictionary from 400K image-caption pairs and uses it to initialize an SAE that enables granular, concept-specific steering of MLLM activations, raising safety scores on MM-SafetyBench and JailBreakV while preserving general capabilities.
CoopGuard deploys cooperative agents to track conversation history and counter evolving multi-round attacks on LLMs, achieving a 78.9% reduction in attack success rate on a new 5,200-sample benchmark.
Only 39% of LLM safety benchmark repositories run without modification, 6% include ethical warnings, and adoption tracks author prominence and runnability rather than code quality metrics.
StrongREJECT provides a standardized benchmark and evaluator for jailbreak attacks that aligns better with human judgments than prior methods and reveals that successful jailbreaks often reduce model capabilities.
Systematic testing of ten LLM agents across 20 tool scenarios and 14 attacks finds universal vulnerability to prompt injection enabling data exfiltration, with tooling amplifying leakage.
GPTFuzz is a black-box fuzzing framework that mutates seed jailbreak templates to automatically generate effective attacks, achieving over 90% success rates on models including ChatGPT and Llama-2.
Real-world jailbreak prompts collected from the wild achieve up to 0.95 attack success rates against major LLMs including GPT-4, with some persisting for over 240 days.
Pruning removes 'unsafe tickets' from LLMs via gradient-free attribution, reducing harmful outputs and jailbreak vulnerability with minimal utility loss.
GUARD automates generation of guideline-violating questions and jailbreak diagnostics to test LLM compliance with government ethics guidelines, validated empirically on eight models and extended to vision-language models.
A modified Llama 3 model using fully homomorphic encryption achieves up to 98% text generation accuracy and 80 tokens per second at 237 ms latency on an i9 CPU.
A survey that creates taxonomies for jailbreak attacks and defenses on LLMs, subdivides them into sub-classes, and compares evaluation approaches.
citing papers explorer
-
A First Look at the Security Issues in the Model Context Protocol Ecosystem
Analysis of 67,057 servers across six registries reveals widespread conditions for server hijacking and metadata manipulation in MCP, with a new tool MCPInspect flagging 833 vulnerable servers and 18 with suspicious descriptions.
-
LASH: Adaptive Semantic Hybridization for Black-Box Jailbreaking of Large Language Models
LASH adaptively composes multiple jailbreak seed prompts via genetic search over subsets and mixture weights to reach 84.5% keyword ASR and 74.5% two-stage ASR on JailbreakBench while using only 30 queries per prompt.
-
When Alignment Isn't Enough: Response-Path Attacks on LLM Agents
A malicious relay can strategically rewrite aligned LLM outputs in BYOK agent architectures to achieve up to 99.1% attack success on benchmarks like AgentDojo and ASB.
-
Jailbreaking the Matrix: Nullspace Steering for Controlled Model Subversion
HMNS is a new jailbreak method that uses causal head identification and nullspace-constrained injection to achieve higher attack success rates than prior techniques on aligned language models.
-
JailWAM: Jailbreaking World Action Models in Robot Control
JailWAM is the first dedicated jailbreak framework for World Action Models, achieving 84.2% attack success rate on LingBot-VA in RoboTwin simulation and enabling safety evaluation of robotic AI.
-
RACC: Representation-Aware Coverage Criteria for LLM Safety Testing
RACC defines six representation-aware coverage criteria that score jailbreak test suites by measuring activation of safety concepts extracted from LLM hidden states on a calibration set.
-
Great, Now Write an Article About That: The Crescendo Multi-Turn LLM Jailbreak Attack
Crescendo is a multi-turn escalation jailbreak that achieves high success rates on GPT-4, Gemini, Llama, and Claude by building on the model's prior responses, with an automated tool outperforming prior attacks on AdvBench.
-
SkillScope: Toward Fine-Grained Least-Privilege Enforcement for Agent Skills
SkillScope detects over-privileged LLM agent skills with 94.53% F1 score via graph analysis and replay validation, finding 7,039 problematic skills in the wild and reducing violations by 88.56% while preserving task completion.
-
Dictionary-Aligned Concept Control for Safeguarding Multimodal LLMs
DACO curates a 15,000-concept dictionary from 400K image-caption pairs and uses it to initialize an SAE that enables granular, concept-specific steering of MLLM activations, raising safety scores on MM-SafetyBench and JailBreakV while preserving general capabilities.
-
CoopGuard: Stateful Cooperative Agents Safeguarding LLMs Against Evolving Multi-Round Attacks
CoopGuard deploys cooperative agents to track conversation history and counter evolving multi-round attacks on LLMs, achieving a 78.9% reduction in attack success rate on a new 5,200-sample benchmark.
-
Benchmark of Benchmarks: Unpacking Influence and Code Repository Quality in LLM Safety Benchmarks
Only 39% of LLM safety benchmark repositories run without modification, 6% include ethical warnings, and adoption tracks author prominence and runnability rather than code quality metrics.
-
A StrongREJECT for Empty Jailbreaks
StrongREJECT provides a standardized benchmark and evaluator for jailbreak attacks that aligns better with human judgments than prior methods and reveals that successful jailbreaks often reduce model capabilities.
-
Whispers in the Machine: Confidentiality in Agentic Systems
Systematic testing of ten LLM agents across 20 tool scenarios and 14 attacks finds universal vulnerability to prompt injection enabling data exfiltration, with tooling amplifying leakage.
-
GPTFUZZER: Red Teaming Large Language Models with Auto-Generated Jailbreak Prompts
GPTFuzz is a black-box fuzzing framework that mutates seed jailbreak templates to automatically generate effective attacks, achieving over 90% success rates on models including ChatGPT and Llama-2.
-
"Do Anything Now": Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models
Real-world jailbreak prompts collected from the wild achieve up to 0.95 attack success rates against major LLMs including GPT-4, with some persisting for over 240 days.
-
Pruning Unsafe Tickets: A Resource-Efficient Framework for Safer and More Robust LLMs
Pruning removes 'unsafe tickets' from LLMs via gradient-free attribution, reducing harmful outputs and jailbreak vulnerability with minimal utility loss.
-
GUARD: Guideline Upholding Test through Adaptive Role-play and Jailbreak Diagnostics for LLMs
GUARD automates generation of guideline-violating questions and jailbreak diagnostics to test LLM compliance with government ethics guidelines, validated empirically on eight models and extended to vision-language models.
-
Fully Homomorphic Encryption on Llama 3 model for privacy preserving LLM inference
A modified Llama 3 model using fully homomorphic encryption achieves up to 98% text generation accuracy and 80 tokens per second at 237 ms latency on an i9 CPU.
-
Jailbreak Attacks and Defenses Against Large Language Models: A Survey
A survey that creates taxonomies for jailbreak attacks and defenses on LLMs, subdivides them into sub-classes, and compares evaluation approaches.