The Autonomy Tax: Defense Training Breaks LLM Agents

Shawn Li; Yue Zhao

arxiv: 2603.19423 · v2 · pith:EMHILWABnew · submitted 2026-03-19 · 💻 cs.CR · cs.AI· cs.LG

The Autonomy Tax: Defense Training Breaks LLM Agents

Shawn Li , Yue Zhao This is my paper

classification 💻 cs.CR cs.AIcs.LG

keywords modelsagentagentsdefensetaskstextbfattacksbaselines

0 comments

read the original abstract

Large language model (LLM) agents increasingly rely on external tools (file operations, API calls, database transactions) to autonomously complete complex multi-step tasks. Practitioners deploy defense-trained models to protect against prompt injection attacks that manipulate agent behavior through malicious observations or retrieved content. We reveal a fundamental \textbf{capability-alignment paradox}: defense training designed to improve safety systematically destroys agent competence while failing to prevent sophisticated attacks. Evaluating defended models against undefended baselines across 97 agent tasks and 1,000 adversarial prompts, we uncover three systematic biases unique to multi-step agents. \textbf{Agent incompetence bias} manifests as immediate tool execution breakdown, with models refusing or generating invalid actions on benign tasks before observing any external content. \textbf{Cascade amplification bias} causes early failures to propagate through retry loops, pushing defended models to timeout on 99\% of tasks compared to 13\% for baselines. \textbf{Trigger bias} leads to paradoxical security degradation where defended models perform worse than undefended baselines while straightforward attacks bypass defenses at high rates. Root cause analysis reveals these biases stem from shortcut learning: models overfit to surface attack patterns rather than semantic threat understanding, evidenced by extreme variance in defense effectiveness across attack categories. Our findings demonstrate that current defense paradigms optimize for single-turn refusal benchmarks while rendering multi-step agents fundamentally unreliable, necessitating new approaches that preserve tool execution competence under adversarial conditions.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Counterfactual Trace Auditing of LLM Agent Skills
cs.AI 2026-05 unverdicted novelty 7.0

CTA framework detects 522 skill influence patterns in LLM agent traces across 49 tasks where average pass rate shifts only +0.3%, exposing evaluation gaps in behavioral effects like template copying and excess planning.
Counterfactual Trace Auditing of LLM Agent Skills
cs.AI 2026-05 unverdicted novelty 7.0

Counterfactual Trace Auditing detects 522 behavioral change patterns from skills on 49 tasks where pass rates shift only 0.3 points on average.
Agent Safety Is Action Alignment
cs.AI 2026-06 unverdicted novelty 6.0

Agent safety cannot be achieved via model refusal training and instead requires external least-privilege enforcement evaluated as action alignment.