A multi-turn intention-deception jailbreak achieves high success on GPT-5 and Claude models while exposing para-jailbreaking where models leak harmful information without direct refusal.
Derail yourself: Multi-turn llm jailbreak attack through self- discovered clues
6 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
representative citing papers
TurnGate identifies the critical turn in multi-turn dialogues where a response would complete hidden malicious intent, outperforming baselines on the new MTID dataset while keeping over-refusal low.
TrajGuard detects jailbreaks by tracking how hidden-state trajectories move toward high-risk regions during decoding, achieving 95% defense rate with 5.2 ms/token latency across tested attacks.
EvoSynth evolves code-based jailbreak algorithms via multi-agent self-correction, reaching 85.5% ASR on Claude-Sonnet-4.5 and 95.9% average across targets with greater diversity.
AGILE is a two-stage jailbreak attack that combines scenario-based rephrasing with activation-guided local editing to reach state-of-the-art attack success rates and strong black-box transferability.