Proteus demonstrates that adaptive red-teaming achieves 40-90% attack success after five rounds and bypasses even strong auditors at up to 41% joint success, revealing that static skill vetting underestimates residual risk.
AgentHarm: A benchmark for measuring harmfulness of LLM agents
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
citation-role summary
background 1
citation-polarity summary
fields
cs.CR 2years
2026 2verdicts
UNVERDICTED 2roles
background 1polarities
background 1representative citing papers
Proposes a trust schema including verification levels and a biconditional correctness criterion to verify skills in human-in-the-loop agent runtimes, reducing the need for constant oversight.
citing papers explorer
-
Proteus: A Self-Evolving Red Team for Agent Skill Ecosystems
Proteus demonstrates that adaptive red-teaming achieves 40-90% attack success after five rounds and bypasses even strong auditors at up to 41% joint success, revealing that static skill vetting underestimates residual risk.
-
Skills as Verifiable Artifacts: A Trust Schema and a Biconditional Correctness Criterion for Human-in-the-Loop Agent Runtimes
Proposes a trust schema including verification levels and a biconditional correctness criterion to verify skills in human-in-the-loop agent runtimes, reducing the need for constant oversight.