pith. sign in

← back to paper

Review history

arxiv: 2605.07926 · 2 revisions

AgentEscapeBench: Evaluating Out-of-Domain Tool-Grounded Reasoning in LLM Agents

  1. 2026-05-21 UNVERDICTED LOW v0.9.0 novelty 7.0
    33390 ms 5793 in 1136 out 2026-05-21T07:55:35.726437+00:00
  2. 2026-05-11 UNVERDICTED LOW v0.9.0 novelty 7.0
    33150 ms 5562 in 1102 out 2026-05-11T03:28:00.724118+00:00