pith. sign in

Drattack: Prompt decomposition and reconstruction makes powerful llm jailbreakers.arXiv preprint arXiv:2402.16914

8 Pith papers cite this work. Polarity classification is still indexing.

8 Pith papers citing it

citation-role summary

method 1

citation-polarity summary

fields

cs.CR 7 cs.CL 1

roles

method 1

polarities

background 1

representative citing papers

Babel: Jailbreaking Safety Attention via Obfuscation Distribution Optimized Sampling

cs.CR · 2026-05-18 · unverdicted · novelty 6.0

Babel is an efficient black-box jailbreaking framework that formalizes sparse safety attention heads via a mathematical obfuscation model and uses iterative distribution refinement to achieve higher attack success rates on models like GPT-4o and Claude-3-5-haiku with around 40 queries.

Benchmarking Misuse Mitigation Against Covert Adversaries

cs.CR · 2025-06-06 · unverdicted · novelty 6.0

Develops the BSD data generation pipeline and two new datasets to evaluate decomposition attacks as effective misuse enablers and stateful defenses as a countermeasure in language model safety.

citing papers explorer

Showing 8 of 8 citing papers.