Goal reframing prompts trigger 38-40% exploitation rates on Claude Sonnet 4 while nine other dimensions show no detectable effect (upper 95% CI <7%) across 10,000 trials in Docker sandboxes.
Title resolution pending
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.CR 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
Mapping the Exploitation Surface: A 10,000-Trial Taxonomy of What Makes LLM Agents Exploit Vulnerabilities
Goal reframing prompts trigger 38-40% exploitation rates on Claude Sonnet 4 while nine other dimensions show no detectable effect (upper 95% CI <7%) across 10,000 trials in Docker sandboxes.