De- fenses against prompt attacks learn surface heuristics,

· 2026 · arXiv 2601.07185

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

representative citing papers

cs.CR · 2026-06-13 · unverdicted · novelty 7.0

AutoDojo adaptively optimizes IPI attacks to bypass defenses, recovering substantial ASR on action-open tasks where static attacks fail.

cs.AI · 2026-06-27 · unverdicted · novelty 6.0

Agent safety cannot be achieved via model refusal training and instead requires external least-privilege enforcement evaluated as action alignment.

Showing 2 of 2 citing papers after filters.

AutoDojo: Adaptive Black-Box Attacks Reveal the Limits of IPI Defenses and Task-Specification Effects in LLM Agents cs.CR · 2026-06-13 · unverdicted · none · ref 47
AutoDojo adaptively optimizes IPI attacks to bypass defenses, recovering substantial ASR on action-open tasks where static attacks fail.
Agent Safety Is Action Alignment cs.AI · 2026-06-27 · unverdicted · none · ref 21
Agent safety cannot be achieved via model refusal training and instead requires external least-privilege enforcement evaluated as action alignment.