← back to paper
arxiv: 2605.22643 · 2 revisions
Boiling the Frog: A Multi-Turn Benchmark for Agentic Safety