pith. sign in

Derail yourself: Multi-turn llm jailbreak attack through self- discovered clues

6 Pith papers cite this work. Polarity classification is still indexing.

6 Pith papers citing it

citation-role summary

background 1 baseline 1

citation-polarity summary

fields

cs.CR 4 cs.CL 2

years

2026 4 2025 2

clear filters

representative citing papers

Activation-Guided Local Editing for Jailbreaking Attacks

cs.CR · 2025-08-01 · unverdicted · novelty 5.0

AGILE is a two-stage jailbreak attack that combines scenario-based rephrasing with activation-guided local editing to reach state-of-the-art attack success rates and strong black-box transferability.

citing papers explorer

Showing 5 of 5 citing papers after filters.