ClawSafety benchmark finds 40-75% attack success rates on frontier LLMs used as agents, with skill-file injections most effective and safety depending on both model and full agent framework.
Depending on the test case, the agent encounters adversarial content through exactly one of the three injection vectors
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.AI 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
ClawSafety: "Safe" LLMs, Unsafe Agents
ClawSafety benchmark finds 40-75% attack success rates on frontier LLMs used as agents, with skill-file injections most effective and safety depending on both model and full agent framework.