Drattack: Prompt decomposition and reconstruction makes powerful llm jailbreakers.arXiv preprint arXiv:2402.16914

Li, X · 2024 · arXiv 2402.16914

8 Pith papers cite this work. Polarity classification is still indexing.

8 Pith papers citing it

read on arXiv browse 8 citing papers

citation-role summary

method 1

citation-polarity summary

background 1

representative citing papers

Jailbreaking the Matrix: Nullspace Steering for Controlled Model Subversion

cs.CR · 2026-04-11 · unverdicted · novelty 7.0

HMNS is a new jailbreak method that uses causal head identification and nullspace-constrained injection to achieve higher attack success rates than prior techniques on aligned language models.

SelfGrader: Stable Jailbreak Detection for Large Language Models using Token-Level Logits

cs.CR · 2026-04-01 · unverdicted · novelty 7.0

SelfGrader detects LLM jailbreaks by interpreting logit distributions on numerical tokens with a dual maliciousness-benignness score, cutting attack success rates up to 22.66% while using up to 173x less memory and 26x less latency.

Babel: Jailbreaking Safety Attention via Obfuscation Distribution Optimized Sampling

cs.CR · 2026-05-18 · unverdicted · novelty 6.0

Babel is an efficient black-box jailbreaking framework that formalizes sparse safety attention heads via a mathematical obfuscation model and uses iterative distribution refinement to achieve higher attack success rates on models like GPT-4o and Claude-3-5-haiku with around 40 queries.

Benchmarking Misuse Mitigation Against Covert Adversaries

cs.CR · 2025-06-06 · unverdicted · novelty 6.0

Develops the BSD data generation pipeline and two new datasets to evaluate decomposition attacks as effective misuse enablers and stateful defenses as a countermeasure in language model safety.

Adversarial Reframing: A Framework for Targeted Generation in Language Models

cs.CR · 2026-05-20 · unverdicted · novelty 5.0

THREAT uses coordinated LLMs in an iterative optimization loop to generate jailbreak prompts that achieve higher success rates and lower detection rates than previous methods across tested models and datasets.

ASTRA: An Automated Framework for Strategy Discovery, Retrieval, and Evolution for Jailbreaking LLMs

cs.CR · 2025-11-04 · unverdicted · novelty 5.0

ASTRA is an automated closed-loop framework that discovers, retrieves, and evolves jailbreak attack strategies for LLMs using a dynamic three-tier strategy library and outperforms baselines in black-box settings.

GUARD: Guideline Upholding Test through Adaptive Role-play and Jailbreak Diagnostics for LLMs

cs.CL · 2025-08-28 · unverdicted · novelty 5.0

GUARD automates generation of guideline-violating questions and jailbreak diagnostics to test LLM compliance with government ethics guidelines, validated empirically on eight models and extended to vision-language models.

Jailbreak Attacks and Defenses Against Large Language Models: A Survey

cs.CR · 2024-07-05 · accept · novelty 4.0

A survey that creates taxonomies for jailbreak attacks and defenses on LLMs, subdivides them into sub-classes, and compares evaluation approaches.

citing papers explorer

Showing 8 of 8 citing papers.

Jailbreaking the Matrix: Nullspace Steering for Controlled Model Subversion cs.CR · 2026-04-11 · unverdicted · none · ref 27
HMNS is a new jailbreak method that uses causal head identification and nullspace-constrained injection to achieve higher attack success rates than prior techniques on aligned language models.
SelfGrader: Stable Jailbreak Detection for Large Language Models using Token-Level Logits cs.CR · 2026-04-01 · unverdicted · none · ref 14
SelfGrader detects LLM jailbreaks by interpreting logit distributions on numerical tokens with a dual maliciousness-benignness score, cutting attack success rates up to 22.66% while using up to 173x less memory and 26x less latency.
Babel: Jailbreaking Safety Attention via Obfuscation Distribution Optimized Sampling cs.CR · 2026-05-18 · unverdicted · none · ref 6
Babel is an efficient black-box jailbreaking framework that formalizes sparse safety attention heads via a mathematical obfuscation model and uses iterative distribution refinement to achieve higher attack success rates on models like GPT-4o and Claude-3-5-haiku with around 40 queries.
Benchmarking Misuse Mitigation Against Covert Adversaries cs.CR · 2025-06-06 · unverdicted · none · ref 20
Develops the BSD data generation pipeline and two new datasets to evaluate decomposition attacks as effective misuse enablers and stateful defenses as a countermeasure in language model safety.
Adversarial Reframing: A Framework for Targeted Generation in Language Models cs.CR · 2026-05-20 · unverdicted · none · ref 29
THREAT uses coordinated LLMs in an iterative optimization loop to generate jailbreak prompts that achieve higher success rates and lower detection rates than previous methods across tested models and datasets.
ASTRA: An Automated Framework for Strategy Discovery, Retrieval, and Evolution for Jailbreaking LLMs cs.CR · 2025-11-04 · unverdicted · none · ref 26
ASTRA is an automated closed-loop framework that discovers, retrieves, and evolves jailbreak attack strategies for LLMs using a dynamic three-tier strategy library and outperforms baselines in black-box settings.
GUARD: Guideline Upholding Test through Adaptive Role-play and Jailbreak Diagnostics for LLMs cs.CL · 2025-08-28 · unverdicted · none · ref 32
GUARD automates generation of guideline-violating questions and jailbreak diagnostics to test LLM compliance with government ethics guidelines, validated empirically on eight models and extended to vision-language models.
Jailbreak Attacks and Defenses Against Large Language Models: A Survey cs.CR · 2024-07-05 · accept · none · ref 51
A survey that creates taxonomies for jailbreak attacks and defenses on LLMs, subdivides them into sub-classes, and compares evaluation approaches.

Drattack: Prompt decomposition and reconstruction makes powerful llm jailbreakers.arXiv preprint arXiv:2402.16914

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer