hub Canonical reference

Jail- breaker: Automated jailbreak across multiple large language model chatbots

· 2023 · arXiv 2307.08715

Canonical reference. 71% of citing Pith papers cite this work as background.

19 Pith papers citing it

Background 71% of classified citations

read on arXiv browse 19 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 5 baseline 1 method 1

citation-polarity summary

background 5 baseline 1 unclear 1

representative citing papers

A First Look at the Security Issues in the Model Context Protocol Ecosystem

cs.CR · 2025-10-18 · conditional · novelty 8.0

Analysis of 67,057 servers across six registries reveals widespread conditions for server hijacking and metadata manipulation in MCP, with a new tool MCPInspect flagging 833 vulnerable servers and 18 with suspicious descriptions.

LASH: Adaptive Semantic Hybridization for Black-Box Jailbreaking of Large Language Models

cs.CL · 2026-05-20 · unverdicted · novelty 7.0

LASH adaptively composes multiple jailbreak seed prompts via genetic search over subsets and mixture weights to reach 84.5% keyword ASR and 74.5% two-stage ASR on JailbreakBench while using only 30 queries per prompt.

When Alignment Isn't Enough: Response-Path Attacks on LLM Agents

cs.CR · 2026-05-04 · unverdicted · novelty 7.0

A malicious relay can strategically rewrite aligned LLM outputs in BYOK agent architectures to achieve up to 99.1% attack success on benchmarks like AgentDojo and ASB.

Jailbreaking the Matrix: Nullspace Steering for Controlled Model Subversion

cs.CR · 2026-04-11 · unverdicted · novelty 7.0

HMNS is a new jailbreak method that uses causal head identification and nullspace-constrained injection to achieve higher attack success rates than prior techniques on aligned language models.

JailWAM: Jailbreaking World Action Models in Robot Control

cs.RO · 2026-04-07 · unverdicted · novelty 7.0

JailWAM is the first dedicated jailbreak framework for World Action Models, achieving 84.2% attack success rate on LingBot-VA in RoboTwin simulation and enabling safety evaluation of robotic AI.

RACC: Representation-Aware Coverage Criteria for LLM Safety Testing

cs.SE · 2026-02-02 · unverdicted · novelty 7.0

RACC defines six representation-aware coverage criteria that score jailbreak test suites by measuring activation of safety concepts extracted from LLM hidden states on a calibration set.

Great, Now Write an Article About That: The Crescendo Multi-Turn LLM Jailbreak Attack

cs.CR · 2024-04-02 · conditional · novelty 7.0

Crescendo is a multi-turn escalation jailbreak that achieves high success rates on GPT-4, Gemini, Llama, and Claude by building on the model's prior responses, with an automated tool outperforming prior attacks on AdvBench.

SkillScope: Toward Fine-Grained Least-Privilege Enforcement for Agent Skills

cs.CR · 2026-05-07 · unverdicted · novelty 6.0

SkillScope detects over-privileged LLM agent skills with 94.53% F1 score via graph analysis and replay validation, finding 7,039 problematic skills in the wild and reducing violations by 88.56% while preserving task completion.

Dictionary-Aligned Concept Control for Safeguarding Multimodal LLMs

cs.LG · 2026-04-10 · unverdicted · novelty 6.0

DACO curates a 15,000-concept dictionary from 400K image-caption pairs and uses it to initialize an SAE that enables granular, concept-specific steering of MLLM activations, raising safety scores on MM-SafetyBench and JailBreakV while preserving general capabilities.

CoopGuard: Stateful Cooperative Agents Safeguarding LLMs Against Evolving Multi-Round Attacks

cs.CR · 2026-04-05 · unverdicted · novelty 6.0

CoopGuard deploys cooperative agents to track conversation history and counter evolving multi-round attacks on LLMs, achieving a 78.9% reduction in attack success rate on a new 5,200-sample benchmark.

Benchmark of Benchmarks: Unpacking Influence and Code Repository Quality in LLM Safety Benchmarks

cs.CR · 2026-03-03 · conditional · novelty 6.0

Only 39% of LLM safety benchmark repositories run without modification, 6% include ethical warnings, and adoption tracks author prominence and runnability rather than code quality metrics.

A StrongREJECT for Empty Jailbreaks

cs.LG · 2024-02-15 · conditional · novelty 6.0

StrongREJECT provides a standardized benchmark and evaluator for jailbreak attacks that aligns better with human judgments than prior methods and reveals that successful jailbreaks often reduce model capabilities.

Whispers in the Machine: Confidentiality in Agentic Systems

cs.CR · 2024-02-10 · unverdicted · novelty 6.0

Systematic testing of ten LLM agents across 20 tool scenarios and 14 attacks finds universal vulnerability to prompt injection enabling data exfiltration, with tooling amplifying leakage.

GPTFUZZER: Red Teaming Large Language Models with Auto-Generated Jailbreak Prompts

cs.AI · 2023-09-19 · unverdicted · novelty 6.0

GPTFuzz is a black-box fuzzing framework that mutates seed jailbreak templates to automatically generate effective attacks, achieving over 90% success rates on models including ChatGPT and Llama-2.

"Do Anything Now": Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models

cs.CR · 2023-08-07 · unverdicted · novelty 6.0

Real-world jailbreak prompts collected from the wild achieve up to 0.95 attack success rates against major LLMs including GPT-4, with some persisting for over 240 days.

Pruning Unsafe Tickets: A Resource-Efficient Framework for Safer and More Robust LLMs

cs.LG · 2026-04-17 · unverdicted · novelty 5.0

Pruning removes 'unsafe tickets' from LLMs via gradient-free attribution, reducing harmful outputs and jailbreak vulnerability with minimal utility loss.

GUARD: Guideline Upholding Test through Adaptive Role-play and Jailbreak Diagnostics for LLMs

cs.CL · 2025-08-28 · unverdicted · novelty 5.0

GUARD automates generation of guideline-violating questions and jailbreak diagnostics to test LLM compliance with government ethics guidelines, validated empirically on eight models and extended to vision-language models.

Fully Homomorphic Encryption on Llama 3 model for privacy preserving LLM inference

cs.CR · 2026-04-14 · unverdicted · novelty 4.0

A modified Llama 3 model using fully homomorphic encryption achieves up to 98% text generation accuracy and 80 tokens per second at 237 ms latency on an i9 CPU.

Jailbreak Attacks and Defenses Against Large Language Models: A Survey

cs.CR · 2024-07-05 · accept · novelty 4.0

A survey that creates taxonomies for jailbreak attacks and defenses on LLMs, subdivides them into sub-classes, and compares evaluation approaches.

citing papers explorer

Showing 14 of 14 citing papers after filters.

LASH: Adaptive Semantic Hybridization for Black-Box Jailbreaking of Large Language Models cs.CL · 2026-05-20 · unverdicted · none · ref 6
LASH adaptively composes multiple jailbreak seed prompts via genetic search over subsets and mixture weights to reach 84.5% keyword ASR and 74.5% two-stage ASR on JailbreakBench while using only 30 queries per prompt.
When Alignment Isn't Enough: Response-Path Attacks on LLM Agents cs.CR · 2026-05-04 · unverdicted · none · ref 55
A malicious relay can strategically rewrite aligned LLM outputs in BYOK agent architectures to achieve up to 99.1% attack success on benchmarks like AgentDojo and ASB.
Jailbreaking the Matrix: Nullspace Steering for Controlled Model Subversion cs.CR · 2026-04-11 · unverdicted · none · ref 13
HMNS is a new jailbreak method that uses causal head identification and nullspace-constrained injection to achieve higher attack success rates than prior techniques on aligned language models.
JailWAM: Jailbreaking World Action Models in Robot Control cs.RO · 2026-04-07 · unverdicted · none · ref 4
JailWAM is the first dedicated jailbreak framework for World Action Models, achieving 84.2% attack success rate on LingBot-VA in RoboTwin simulation and enabling safety evaluation of robotic AI.
RACC: Representation-Aware Coverage Criteria for LLM Safety Testing cs.SE · 2026-02-02 · unverdicted · none · ref 18
RACC defines six representation-aware coverage criteria that score jailbreak test suites by measuring activation of safety concepts extracted from LLM hidden states on a calibration set.
SkillScope: Toward Fine-Grained Least-Privilege Enforcement for Agent Skills cs.CR · 2026-05-07 · unverdicted · none · ref 11
SkillScope detects over-privileged LLM agent skills with 94.53% F1 score via graph analysis and replay validation, finding 7,039 problematic skills in the wild and reducing violations by 88.56% while preserving task completion.
Dictionary-Aligned Concept Control for Safeguarding Multimodal LLMs cs.LG · 2026-04-10 · unverdicted · none · ref 16
DACO curates a 15,000-concept dictionary from 400K image-caption pairs and uses it to initialize an SAE that enables granular, concept-specific steering of MLLM activations, raising safety scores on MM-SafetyBench and JailBreakV while preserving general capabilities.
CoopGuard: Stateful Cooperative Agents Safeguarding LLMs Against Evolving Multi-Round Attacks cs.CR · 2026-04-05 · unverdicted · none · ref 5
CoopGuard deploys cooperative agents to track conversation history and counter evolving multi-round attacks on LLMs, achieving a 78.9% reduction in attack success rate on a new 5,200-sample benchmark.
Whispers in the Machine: Confidentiality in Agentic Systems cs.CR · 2024-02-10 · unverdicted · none · ref 51
Systematic testing of ten LLM agents across 20 tool scenarios and 14 attacks finds universal vulnerability to prompt injection enabling data exfiltration, with tooling amplifying leakage.
GPTFUZZER: Red Teaming Large Language Models with Auto-Generated Jailbreak Prompts cs.AI · 2023-09-19 · unverdicted · none · ref 14
GPTFuzz is a black-box fuzzing framework that mutates seed jailbreak templates to automatically generate effective attacks, achieving over 90% success rates on models including ChatGPT and Llama-2.
"Do Anything Now": Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models cs.CR · 2023-08-07 · unverdicted · none · ref 26
Real-world jailbreak prompts collected from the wild achieve up to 0.95 attack success rates against major LLMs including GPT-4, with some persisting for over 240 days.
Pruning Unsafe Tickets: A Resource-Efficient Framework for Safer and More Robust LLMs cs.LG · 2026-04-17 · unverdicted · none · ref 10
Pruning removes 'unsafe tickets' from LLMs via gradient-free attribution, reducing harmful outputs and jailbreak vulnerability with minimal utility loss.
GUARD: Guideline Upholding Test through Adaptive Role-play and Jailbreak Diagnostics for LLMs cs.CL · 2025-08-28 · unverdicted · none · ref 26
GUARD automates generation of guideline-violating questions and jailbreak diagnostics to test LLM compliance with government ethics guidelines, validated empirically on eight models and extended to vision-language models.
Fully Homomorphic Encryption on Llama 3 model for privacy preserving LLM inference cs.CR · 2026-04-14 · unverdicted · none · ref 13
A modified Llama 3 model using fully homomorphic encryption achieves up to 98% text generation accuracy and 80 tokens per second at 237 ms latency on an i9 CPU.

Jail- breaker: Automated jailbreak across multiple large language model chatbots

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer