{"paper":{"title":"AgentEscapeBench: Evaluating Out-of-Domain Tool-Grounded Reasoning in LLM Agents","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"LLM agents handle short tool sequences but lose substantial accuracy when required to track deep chains of dependencies across novel procedures.","cross_cats":[],"primary_cat":"cs.AI","authors_text":"Dongyu Ru, Jingwen Xv, Lin Qiu, Xiaohua Wang, Xiaoqing Zheng, Xiaoyu Li, Xuezhi Cao, Xunliang Cai, Yiyang Li, Zhengkang Guo","submitted_at":"2026-05-08T15:59:27Z","abstract_excerpt":"As LLM-based agents increasingly rely on external tools, it is important to evaluate their ability to sustain tool-grounded reasoning beyond familiar workflows and short-range interactions. We introduce AgentEscapeBench, an escape-room-style benchmark that tests whether agents can infer, execute, and revise novel tool-use procedures under explicit long-range dependency constraints. Each task defines a directed acyclic dependency graph over tools and items, requiring agents to invoke real external functions, track hidden state revealed incrementally, propagate intermediate results, and submit a"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"Experiments with sixteen LLM agents and human participants show that performance drops sharply as dependency depth increases: humans decline from 98.3% success at difficulty-5 to 80.0% at difficulty-25, while the best model drops from 90.0% to 60.0%.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That the escape-room tasks with explicit DAG constraints and incremental state revelation accurately capture the core challenges of out-of-domain tool-grounded reasoning without introducing benchmark-specific artifacts or overly artificial constraints.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"AgentEscapeBench shows LLM agents' success rates drop from 90% to 60% as tool-dependency depth increases from 5 to 25 steps, while humans drop only from 98% to 80%.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"LLM agents handle short tool sequences but lose substantial accuracy when required to track deep chains of dependencies across novel procedures.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"ff43d7e8accb9131a0e7e9f3cc68309f3b9957f1d5d0312ae9631a92ed0e892d"},"source":{"id":"2605.07926","kind":"arxiv","version":2},"verdict":{"id":"442f6878-78d8-4d65-82e2-bb497f852b74","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-11T03:28:00.724118Z","strongest_claim":"Experiments with sixteen LLM agents and human participants show that performance drops sharply as dependency depth increases: humans decline from 98.3% success at difficulty-5 to 80.0% at difficulty-25, while the best model drops from 90.0% to 60.0%.","one_line_summary":"AgentEscapeBench shows LLM agents' success rates drop from 90% to 60% as tool-dependency depth increases from 5 to 25 steps, while humans drop only from 98% to 80%.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That the escape-room tasks with explicit DAG constraints and incremental state revelation accurately capture the core challenges of out-of-domain tool-grounded reasoning without introducing benchmark-specific artifacts or overly artificial constraints.","pith_extraction_headline":"LLM agents handle short tool sequences but lose substantial accuracy when required to track deep chains of dependencies across novel procedures."},"integrity":{"clean":true,"summary":{"advisory":0,"critical":0,"by_detector":{},"informational":0},"endpoint":"/pith/2605.07926/integrity.json","findings":[],"available":true,"detectors_run":[{"name":"claim_evidence","ran_at":"2026-05-20T10:02:13.538218Z","status":"completed","version":"1.0.0","findings_count":0},{"name":"ai_meta_artifact","ran_at":"2026-05-20T04:46:14.043446Z","status":"completed","version":"1.0.0","findings_count":0},{"name":"doi_title_agreement","ran_at":"2026-05-19T15:31:18.284052Z","status":"completed","version":"1.0.0","findings_count":0},{"name":"doi_compliance","ran_at":"2026-05-19T11:23:49.029118Z","status":"completed","version":"1.0.0","findings_count":0}],"snapshot_sha256":"ce6a8a5f131eeed6dd64df3e7607604f3ce61d7b9ee8ffaa994b1ff9aa314519"},"references":{"count":0,"sample":[],"resolved_work":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57","internal_anchors":0},"formal_canon":{"evidence_count":2,"snapshot_sha256":"db7545589050c9d2a56954e6db12d4d5ea3a2ac4aa29e0852203ff552838724b"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"}