pith. sign in

Llms know their vulnerabilities: Uncover safety gaps through natural distribution shifts.arXiv preprint arXiv:2410.10700

6 Pith papers cite this work. Polarity classification is still indexing.

6 Pith papers citing it

citation-role summary

background 1 baseline 1

citation-polarity summary

fields

cs.CR 4 cs.CL 2

years

2026 4 2025 2

representative citing papers

Activation-Guided Local Editing for Jailbreaking Attacks

cs.CR · 2025-08-01 · unverdicted · novelty 5.0

AGILE is a two-stage jailbreak attack that combines scenario-based rephrasing with activation-guided local editing to reach state-of-the-art attack success rates and strong black-box transferability.

citing papers explorer

Showing 6 of 6 citing papers.