Review history

arxiv: 2605.07926 · 2 revisions

AgentEscapeBench: Evaluating Out-of-Domain Tool-Grounded Reasoning in LLM Agents

2026-05-21 UNVERDICTED LOW v0.9.0 novelty 7.0

33390 ms 5793 in 1136 out 2026-05-21T07:55:35.726437+00:00
2026-05-11 UNVERDICTED LOW v0.9.0 novelty 7.0

33150 ms 5562 in 1102 out 2026-05-11T03:28:00.724118+00:00