← back to paper
arxiv: 2605.07926 · 2 revisions
AgentEscapeBench: Evaluating Out-of-Domain Tool-Grounded Reasoning in LLM Agents