AgentEscapeBench: Evaluating Out-of-Domain Tool-Grounded Reasoning in LLM Agents

Dongyu Ru; Jingwen Xv; Lin Qiu; Xiaohua Wang; Xiaoqing Zheng; Xiaoyu Li; Xuezhi Cao; Xunliang Cai; Yiyang Li; Zhengkang Guo

arxiv: 2605.07926 · v2 · pith:A4YI2J37new · submitted 2026-05-08 · 💻 cs.AI

AgentEscapeBench: Evaluating Out-of-Domain Tool-Grounded Reasoning in LLM Agents

Zhengkang Guo , Yiyang Li , Lin Qiu , Xiaohua Wang , Jingwen Xv , Dongyu Ru , Xiaoyu Li , Xiaoqing Zheng

show 2 more authors

Xuezhi Cao Xunliang Cai

This is my paper

Pith reviewed 2026-05-21 07:55 UTC · model grok-4.3

classification 💻 cs.AI

keywords LLM agentstool usebenchmarklong-range dependenciesstate trackingreasoningout-of-domain

0 comments

The pith

A new escape-room benchmark reveals LLM agents lose ground when tool dependencies span many steps.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces AgentEscapeBench to test whether LLM agents can sustain tool-grounded reasoning over long-range dependencies that lie outside familiar workflows. Tasks are built as directed acyclic graphs of tools and items, with state revealed only incrementally so agents must track hidden information, execute functions, propagate results, and reach a verifiable answer. Experiments with sixteen agents show success rates fall markedly as dependency depth rises, while human performance declines more modestly. The work aims to expose specific failure modes such as breakdowns in state tracking and clue adherence that current agents exhibit beyond short-range interactions.

Core claim

Current agents handle local tool use but struggle with deep contextual dependencies, as shown by the sharp performance drop from 90 percent at the shallowest tier to 60 percent at the deepest tier across 270 tasks in five difficulty levels.

What carries the argument

Directed acyclic dependency graph over tools and items with incremental state revelation, forcing agents to maintain and propagate information across multiple steps.

If this is right

Model failures concentrate in long-range state tracking, clue adherence, and intermediate-result propagation.
Agents require improved mechanisms for maintaining coherence across extended sequences of tool calls.
The benchmark supplies a fully automated, scalable testbed for measuring progress toward more robust agent reasoning.
Training efforts should target adaptation to novel dependency structures rather than only familiar short-range patterns.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same dependency-tracking weakness may limit performance in other agent domains that involve chained external actions.
Incorporating longer synthetic dependency chains into training data could narrow the observed gap with human performance.
Real-world applications such as automated software debugging or multi-step scientific workflows may face similar constraints.

Load-bearing premise

The escape-room tasks and their dependency graphs accurately reflect the structure of real out-of-domain tool-grounded reasoning problems.

What would settle it

Measure whether agents that succeed on the benchmark also succeed at comparable rates on actual multi-step external-tool workflows whose dependency depth has been independently quantified.

Figures

Figures reproduced from arXiv: 2605.07926 by Dongyu Ru, Jingwen Xv, Lin Qiu, Xiaohua Wang, Xiaoqing Zheng, Xiaoyu Li, Xuezhi Cao, Xunliang Cai, Yiyang Li, Zhengkang Guo.

**Figure 1.** Figure 1: Conceptual illustration of AgentEscapeBench. The agent is placed in a themed escape room populated with unfamiliar tools and hidden items. It must explore the environment, invoke tools with correct parameters derived from narrative clues, and propagate intermediate outputs through a multi-step dependency chain to unlock the final exit. Strong performance may therefore reflect learned domain conventions or … view at source ↗

**Figure 2.** Figure 2: Six-stage data construction pipeline of AgentEscapeBench. Starting from a curated template library (Stage 1), a reverse-generation algorithm assembles a DAG skeleton (Stage 2), annotates source ports (Stage 3), instantiates concrete values via an LLM (Stage 4), executes the DAG forward to compute ground-truth outputs (Stage 5), and generates themed narratives with structural validation (Stage 6). 3 Experim… view at source ↗

**Figure 3.** Figure 3: Behavioural metric trends across difficulty levels. (a) Source-node convergence speed (lower is better): the average number of attempts to correctly resolve a source node. (b) Premature invocation rate (lower is better): the fraction of non-source invocations made before all predecessors are resolved. (c) Clue adherence rate (higher is better): the fraction of downstream invocations whose arguments trace b… view at source ↗

**Figure 4.** Figure 4: Tool calls vs. success rate at difficulty-10. Upper-left is better (fewer calls, higher SR). From [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Interaction trace (page 1/3): system prompt, environment initialization, and first investiga [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗

**Figure 6.** Figure 6: Interaction trace (page 2/3): remaining investigations, dependency reasoning, and tool [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗

**Figure 7.** Figure 7: Interaction trace (page 3/3): final computation steps and successful answer submission. [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗

read the original abstract

As LLM-based agents increasingly rely on external tools, it is important to evaluate their ability to sustain tool-grounded reasoning beyond familiar workflows and short-range interactions. We introduce AgentEscapeBench, an escape-room-style benchmark that tests whether agents can infer, execute, and revise novel tool-use procedures under explicit long-range dependency constraints. Each task defines a directed acyclic dependency graph over tools and items, requiring agents to invoke real external functions, track hidden state revealed incrementally, propagate intermediate results, and submit a deterministically verifiable final answer. AgentEscapeBench includes 270 instances across five difficulty tiers and supports fully automated evaluation. Experiments with sixteen LLM agents and human participants show that performance drops sharply as dependency depth increases: humans decline from 98.3% success at difficulty-5 to 80.0% at difficulty-25, while the best model drops from 90.0% to 60.0%. Trajectory analysis attributes model failures mainly to breakdowns in long-range state tracking, clue adherence, and intermediate-result propagation. These findings suggest that current agents can often handle local tool use but still struggle with deep contextual dependencies. We hope AgentEscapeBench can serve as a diagnostic testbed for measuring current agent capabilities and informing future training efforts toward more robust general-purpose reasoning, action, and adaptation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces AgentEscapeBench, an escape-room-style benchmark with 270 instances across five difficulty tiers (difficulty-5 to difficulty-25). Tasks are defined via directed acyclic dependency graphs (DAGs) over tools and items, requiring agents to invoke external functions, track incrementally revealed hidden state, propagate intermediate results, and produce a verifiable final answer. Experiments with 16 LLM agents and human participants show sharp performance declines with increasing dependency depth: humans drop from 98.3% success at difficulty-5 to 80.0% at difficulty-25, while the best model drops from 90.0% to 60.0%. Trajectory analysis attributes model failures mainly to breakdowns in long-range state tracking, clue adherence, and intermediate-result propagation. The benchmark is positioned as a diagnostic testbed for out-of-domain tool-grounded reasoning in LLM agents.

Significance. If the benchmark successfully isolates dependency depth without confounds from task length or state size, the work offers a concrete, automated evaluation framework with human baselines that can diagnose specific limitations in current agents' long-range reasoning and inform training for more robust tool use. The direct measurement against human performance and the focus on falsifiable, verifiable outcomes are strengths that could make this a useful addition to agent evaluation suites.

major comments (2)

[Task design description (abstract and benchmark construction section)] Task design description (abstract and § on benchmark construction): The manuscript does not report whether the number of required tool calls, state variables, or total sequence length is held constant across the five difficulty tiers. If deeper tiers systematically increase step count or state cardinality, the observed drops (humans 98.3%→80%, best model 90%→60%) could arise from cumulative error accumulation rather than depth of contextual dependencies per se.
[Experimental setup (abstract and results section)] Experimental setup (abstract and results section): Performance drops and failure modes are reported from experiments with 16 agents and humans, but no details are provided on task construction procedures, data exclusion criteria, or statistical controls. This makes full verification of the results difficult and weakens the ability to attribute failures specifically to long-range dependencies.

minor comments (2)

Clarify the exact definition of 'difficulty-5' to 'difficulty-25' tiers, including how dependency depth is quantified in the DAGs.
Add a table or figure summarizing the distribution of tool calls and state variables per tier to address potential confounds.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful and constructive comments on our manuscript. We address each major comment point by point below and outline the revisions we intend to make.

read point-by-point responses

Referee: [Task design description (abstract and benchmark construction section)] Task design description (abstract and § on benchmark construction): The manuscript does not report whether the number of required tool calls, state variables, or total sequence length is held constant across the five difficulty tiers. If deeper tiers systematically increase step count or state cardinality, the observed drops (humans 98.3%→80%, best model 90%→60%) could arise from cumulative error accumulation rather than depth of contextual dependencies per se.

Authors: We appreciate the referee's concern about potential confounds in the benchmark design. The five difficulty tiers are constructed by varying the maximum depth of the directed acyclic dependency graphs, while maintaining similar numbers of tools and items across tiers through controlled graph generation parameters. This design aims to isolate the impact of dependency depth. However, to make this explicit and allow for better verification, we will add a detailed analysis in the benchmark construction section, including statistics on the number of required tool calls, state variables, and sequence lengths for each tier. We will also discuss how the design mitigates cumulative error accumulation effects. revision: yes
Referee: [Experimental setup (abstract and results section)] Experimental setup (abstract and results section): Performance drops and failure modes are reported from experiments with 16 agents and humans, but no details are provided on task construction procedures, data exclusion criteria, or statistical controls. This makes full verification of the results difficult and weakens the ability to attribute failures specifically to long-range dependencies.

Authors: We acknowledge that the current manuscript lacks sufficient detail on the experimental procedures. In the revised manuscript, we will expand the relevant sections to describe the task construction procedures in full, including the algorithm used to generate the DAGs for each difficulty tier. We confirm that all generated tasks were included without exclusion, as they all satisfied the solvability and verifiability criteria. Additionally, we will provide details on the statistical controls, such as the number of trials per agent, how human participants were recruited and instructed, and the methods for computing success rates and analyzing failure modes. This will facilitate full verification and strengthen the attribution to long-range dependencies. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark with direct measurements

full rationale

The paper introduces AgentEscapeBench as an empirical evaluation framework consisting of 270 task instances across difficulty tiers defined by DAG dependency depth. All reported results (human success 98.3% to 80%, best model 90% to 60%) are direct experimental measurements against human baselines and automated verification, with no derivations, equations, fitted parameters, or predictions that reduce to inputs by construction. Trajectory analysis attributes failures to observed behaviors without invoking self-citations as load-bearing uniqueness theorems or ansatzes. The design is self-contained and externally falsifiable via the released benchmark, satisfying the criteria for an honest non-finding of circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that the constructed tasks measure general tool-grounded reasoning capabilities; no free parameters or invented entities are introduced beyond the benchmark tasks themselves.

axioms (1)

domain assumption Tasks require agents to infer, execute, and revise novel tool-use procedures under explicit long-range dependency constraints.
Core premise of the benchmark definition in the abstract.

pith-pipeline@v0.9.0 · 5793 in / 1136 out tokens · 33390 ms · 2026-05-21T07:55:35.726437+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Each task defines a directed acyclic dependency graph over tools and items, requiring agents to invoke real external functions, track hidden state revealed incrementally, propagate intermediate results...
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

DAG skeleton generation... reverse-generation algorithm... structural constraints—acyclicity, single-use semantics...

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

31 extracted references · 31 canonical work pages · 5 internal anchors

[1]

Claude Code: Agentic coding in the real world

Anthropic. Claude Code: Agentic coding in the real world. https://www.anthropic.com/, 2025. Accessed: 2025

work page 2025
[2]

$\tau^2$-Bench: Evaluating Conversational Agents in a Dual-Control Environment

Victor Barres, Honghua Dong, Soham Ray, Xujie Si, and Karthik Narasimhan. τ 2-bench: Evaluating conversational agents in a dual-control environment.arXiv preprint arXiv:2506.07982, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

Fixme: Find correct gaia-2 agent benchmark paper.FIXME, 2025

FIXME. Fixme: Find correct gaia-2 agent benchmark paper.FIXME, 2025

work page 2025
[4]

Vitabench: Benchmarking llm agents with versatile interactive tasks in real-world applications.arXiv preprint arXiv:2509.26490, 2025

Wei He, Yueqing Sun, Hongyan Hao, Xueyuan Hao, Zhikang Xia, Qi Gu, Chengcheng Han, Dengchang Zhao, Hui Su, Kefeng Zhang, et al. Vitabench: Benchmarking llm agents with versatile interactive tasks in real-world applications.arXiv preprint arXiv:2509.26490, 2025

work page arXiv 2025
[5]

Measuring Mathematical Problem Solving With the MATH Dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset.arXiv preprint arXiv:2103.03874, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[6]

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. Swe-bench: Can language models resolve real-world github issues?arXiv preprint arXiv:2310.06770, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[7]

PuzzleWorld: A Benchmark for Multimodal, Open-Ended Reasoning in Puzzlehunts

Hengzhi Li, Brendon Jiang, Alexander Naehu, Regan Song, Justin Zhang, Megan Tjandrasuwita, Chanakya Ekbote, Steven-Shine Chen, Adithya Balachandran, Wei Dai, et al. Puzzleworld: A benchmark for multimodal, open-ended reasoning in puzzlehunts.arXiv preprint arXiv:2506.06211, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[8]

Visescape: A benchmark for evaluating exploration-driven decision-making in virtual escape rooms

Seungwon Lim, Sungwoong Kim, Jihwan Yu, Sungjae Lee, Jiwan Chung, and Youngjae Yu. Visescape: A benchmark for evaluating exploration-driven decision-making in virtual escape rooms. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 16031–16058, 2025

work page 2025
[9]

AgentBench: Evaluating LLMs as Agents

Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, et al. Agentbench: Evaluating llms as agents.arXiv preprint arXiv:2308.03688, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[10]

Ultrahorizon: Benchmarking agent capabilities in ultra long-horizon scenarios.arXiv preprint arXiv:2509.21766, 2025

Haotian Luo, Huaisong Zhang, Xuelin Zhang, Haoyu Wang, Zeyu Qin, Wenjie Lu, Guozheng Ma, Haiying He, Yingsha Xie, Qiyang Zhou, et al. Ultrahorizon: Benchmarking agent capabilities in ultra long-horizon scenarios.arXiv preprint arXiv:2509.21766, 2025

work page arXiv 2025
[11]

Aime 2025, 2025

MAA. Aime 2025, 2025. URL https://artofproblemsolving.com/wiki/index.php/AIME_ Problems_and_Solutions

work page 2025
[12]

The berkeley function calling leaderboard (bfcl): From tool use to agentic evaluation of large language models

Shishir G Patil, Huanzhi Mao, Fanjia Yan, Charlie Cheng-Jie Ji, Vishnu Suresh, Ion Stoica, and Joseph E Gonzalez. The berkeley function calling leaderboard (bfcl): From tool use to agentic evaluation of large language models. InForty-second International Conference on Machine Learning, 2025

work page 2025
[13]

Escapebench: Pushing language models to think outside the box

Cheng Qian, Peixuan Han, Qinyu Luo, Bingxiang He, Xiusi Chen, Yuji Zhang, Hongyi Du, Jiarui Yao, Xiaocheng Yang, Denghui Zhang, et al. Escapebench: Pushing language models to think outside the box. arXiv e-prints, pages arXiv–2412, 2024

work page 2024
[14]

Making language models better tool learners with execution feedback

Shuofei Qiao, Honghao Gui, Chengfei Lv, Qianghuai Jia, Huajun Chen, and Ningyu Zhang. Making language models better tool learners with execution feedback. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 3550–3568, 2024. 10

work page 2024
[15]

TRIP-Bench: A benchmark for long-horizon interactive agents in real-world scenarios.arXiv preprint arXiv:2602.01675, 2026

Yuanzhe Shen, Zisu Huang, Zhengyuan Wang, Muzhao Tian, Zhengkang Guo, Chenyang Zhang, Shuaiyu Zhou, Zengjie Hu, Dailin Li, Jingwen Xu, et al. Trip-bench: A benchmark for long-horizon interactive agents in real-world scenarios.arXiv preprint arXiv:2602.01675, 2026

work page arXiv 2026
[16]

Big escape benchmark: Evaluating human-like reasoning in language models via real-world escape room challenges

Zinan Tang and Qiyao Sun. Big escape benchmark: Evaluating human-like reasoning in language models via real-world escape room challenges. InProceedings of the Fourth Workshop on Generation, Evaluation and Metrics (GEM2), pages 488–503, 2025

work page 2025
[17]

Wang et al

Zhenting Wang, Qi Chang, Hemani Patel, Shashank Biju, Cheng-En Wu, Quan Liu, Aolin Ding, Alireza Rezazadeh, Ankit Shah, Yujia Bao, et al. Mcp-bench: Benchmarking tool-using llm agents with complex real-world tasks via mcp servers.arXiv preprint arXiv:2508.20453, 2025

work page arXiv 2025
[18]

Evaluating and improving tool-augmented computation-intensive math reasoning.Advances in Neural Information Processing Systems, 36:23570–23589, 2023

Beichen Zhang, Kun Zhou, Xilin Wei, Xin Zhao, Jing Sha, Shijin Wang, and Ji-Rong Wen. Evaluating and improving tool-augmented computation-intensive math reasoning.Advances in Neural Information Processing Systems, 36:23570–23589, 2023. A Dataset Construction Details This appendix provides full algorithmic details for the six-stage data construction pipeli...

work page 2023
[19]

A node is instantiated from this template and designated as the win node; the puzzle’s success condition is defined as producing the correct output of this node

Sample final-goal node.A template with at least one input port is selected uniformly at random. A node is instantiated from this template and designated as the win node; the puzzle’s success condition is defined as producing the correct output of this node

work page
[20]

Initialise pending queue.All input ports of the final-goal node are enqueued as pending requirements, each represented as a tuple (v, p, τ, r): target node v, target port p, required type τ, and retry countr= 0. 11 Algorithm 1DAG Skeleton Generation via Reverse Growth Require:Target node countn, template libraryT Ensure:DAG skeletonG= (V, E) 1:Select a ra...

work page
[21]

The output is a structuredsource_initmap attached to the DAG metadata

Drop-node outputs: if a node’s triggering input is of type ITEM/HIDDEN_ITEM (indicating activation by a physical object from a container), its output ports that feed downstream nodes are also flagged, because these values must be materialised before forward execution can propagate through them. The output is a structuredsource_initmap attached to the DAG ...

work page
[22]

Triggered when the server reports missing parameters, an empty parameter set, or a partial parameter submission

Missing required parameter— The agent omits one or more mandatory input parameters when invoking a tool or item. Triggered when the server reports missing parameters, an empty parameter set, or a partial parameter submission

work page
[23]

Wrong node type— The agent attempts an operation incompatible with the node’s type, such as callinguse_itemwith arguments on a simple Item that should be accessed viainvestigate

work page
[24]

Repeated solved node— The agent attempts to unlock an item that has already been successfully solved in a previous step, wasting a turn

work page
[25]

Wrong parameter type— The agent provides a parameter value with an incorrect data type (e.g., passing an integer where a hex string is expected)

work page
[26]

Node not visible— The agent attempts to invoke a node that does not currently exist in the environment, or mistakenly treats a non-node entity as an invocable node

work page
[27]

Node not exist— The agent references a node ID that does not exist in the current scenario, typically due to hallucinating a node name or confusing template names with instance-specific IDs

work page
[28]

calc_modexp

Wrong format— The agent’s tool-call request is structurally malformed, missing essential fields such asnode_id. 8.Other— Errors that do not match any of the above classification rules. C Example Solving Trajectory We present a complete interaction trace of Claude-Opus-4.6 solving a difficulty-10 instance. The agent successfully solves the puzzle in 12 tur...

work page
[29]

**text_bidi_sanitize** password for zip (input: `ComplexPassphrase_123!`)

work page
[30]

**zip_unzip_file** file_id for graph

work page
[31]

ComplexPassphrase_123!

**graph_shortest_path** cost (used in calc_gcd and calc_big [...reasoning...] >> text_bidi_sanitize(text="ComplexPassphrase_123!") >> checksum_crc32(hex="4a2b3c4d5e6f708192a3b4...") >> num_base_convert_83f4(from_base="10", s="9182736450192837465192...", to_base="10") Tool Tool text_bidi_sanitize invoked Output: {"text": "ComplexPassphrase_123!"} All param...

work page

[1] [1]

Claude Code: Agentic coding in the real world

Anthropic. Claude Code: Agentic coding in the real world. https://www.anthropic.com/, 2025. Accessed: 2025

work page 2025

[2] [2]

$\tau^2$-Bench: Evaluating Conversational Agents in a Dual-Control Environment

Victor Barres, Honghua Dong, Soham Ray, Xujie Si, and Karthik Narasimhan. τ 2-bench: Evaluating conversational agents in a dual-control environment.arXiv preprint arXiv:2506.07982, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[3] [3]

Fixme: Find correct gaia-2 agent benchmark paper.FIXME, 2025

FIXME. Fixme: Find correct gaia-2 agent benchmark paper.FIXME, 2025

work page 2025

[4] [4]

Vitabench: Benchmarking llm agents with versatile interactive tasks in real-world applications.arXiv preprint arXiv:2509.26490, 2025

Wei He, Yueqing Sun, Hongyan Hao, Xueyuan Hao, Zhikang Xia, Qi Gu, Chengcheng Han, Dengchang Zhao, Hui Su, Kefeng Zhang, et al. Vitabench: Benchmarking llm agents with versatile interactive tasks in real-world applications.arXiv preprint arXiv:2509.26490, 2025

work page arXiv 2025

[5] [5]

Measuring Mathematical Problem Solving With the MATH Dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset.arXiv preprint arXiv:2103.03874, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[6] [6]

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. Swe-bench: Can language models resolve real-world github issues?arXiv preprint arXiv:2310.06770, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[7] [7]

PuzzleWorld: A Benchmark for Multimodal, Open-Ended Reasoning in Puzzlehunts

Hengzhi Li, Brendon Jiang, Alexander Naehu, Regan Song, Justin Zhang, Megan Tjandrasuwita, Chanakya Ekbote, Steven-Shine Chen, Adithya Balachandran, Wei Dai, et al. Puzzleworld: A benchmark for multimodal, open-ended reasoning in puzzlehunts.arXiv preprint arXiv:2506.06211, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[8] [8]

Visescape: A benchmark for evaluating exploration-driven decision-making in virtual escape rooms

Seungwon Lim, Sungwoong Kim, Jihwan Yu, Sungjae Lee, Jiwan Chung, and Youngjae Yu. Visescape: A benchmark for evaluating exploration-driven decision-making in virtual escape rooms. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 16031–16058, 2025

work page 2025

[9] [9]

AgentBench: Evaluating LLMs as Agents

Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, et al. Agentbench: Evaluating llms as agents.arXiv preprint arXiv:2308.03688, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[10] [10]

Ultrahorizon: Benchmarking agent capabilities in ultra long-horizon scenarios.arXiv preprint arXiv:2509.21766, 2025

Haotian Luo, Huaisong Zhang, Xuelin Zhang, Haoyu Wang, Zeyu Qin, Wenjie Lu, Guozheng Ma, Haiying He, Yingsha Xie, Qiyang Zhou, et al. Ultrahorizon: Benchmarking agent capabilities in ultra long-horizon scenarios.arXiv preprint arXiv:2509.21766, 2025

work page arXiv 2025

[11] [11]

Aime 2025, 2025

MAA. Aime 2025, 2025. URL https://artofproblemsolving.com/wiki/index.php/AIME_ Problems_and_Solutions

work page 2025

[12] [12]

The berkeley function calling leaderboard (bfcl): From tool use to agentic evaluation of large language models

Shishir G Patil, Huanzhi Mao, Fanjia Yan, Charlie Cheng-Jie Ji, Vishnu Suresh, Ion Stoica, and Joseph E Gonzalez. The berkeley function calling leaderboard (bfcl): From tool use to agentic evaluation of large language models. InForty-second International Conference on Machine Learning, 2025

work page 2025

[13] [13]

Escapebench: Pushing language models to think outside the box

Cheng Qian, Peixuan Han, Qinyu Luo, Bingxiang He, Xiusi Chen, Yuji Zhang, Hongyi Du, Jiarui Yao, Xiaocheng Yang, Denghui Zhang, et al. Escapebench: Pushing language models to think outside the box. arXiv e-prints, pages arXiv–2412, 2024

work page 2024

[14] [14]

Making language models better tool learners with execution feedback

Shuofei Qiao, Honghao Gui, Chengfei Lv, Qianghuai Jia, Huajun Chen, and Ningyu Zhang. Making language models better tool learners with execution feedback. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 3550–3568, 2024. 10

work page 2024

[15] [15]

TRIP-Bench: A benchmark for long-horizon interactive agents in real-world scenarios.arXiv preprint arXiv:2602.01675, 2026

Yuanzhe Shen, Zisu Huang, Zhengyuan Wang, Muzhao Tian, Zhengkang Guo, Chenyang Zhang, Shuaiyu Zhou, Zengjie Hu, Dailin Li, Jingwen Xu, et al. Trip-bench: A benchmark for long-horizon interactive agents in real-world scenarios.arXiv preprint arXiv:2602.01675, 2026

work page arXiv 2026

[16] [16]

Big escape benchmark: Evaluating human-like reasoning in language models via real-world escape room challenges

Zinan Tang and Qiyao Sun. Big escape benchmark: Evaluating human-like reasoning in language models via real-world escape room challenges. InProceedings of the Fourth Workshop on Generation, Evaluation and Metrics (GEM2), pages 488–503, 2025

work page 2025

[17] [17]

Wang et al

Zhenting Wang, Qi Chang, Hemani Patel, Shashank Biju, Cheng-En Wu, Quan Liu, Aolin Ding, Alireza Rezazadeh, Ankit Shah, Yujia Bao, et al. Mcp-bench: Benchmarking tool-using llm agents with complex real-world tasks via mcp servers.arXiv preprint arXiv:2508.20453, 2025

work page arXiv 2025

[18] [18]

Evaluating and improving tool-augmented computation-intensive math reasoning.Advances in Neural Information Processing Systems, 36:23570–23589, 2023

Beichen Zhang, Kun Zhou, Xilin Wei, Xin Zhao, Jing Sha, Shijin Wang, and Ji-Rong Wen. Evaluating and improving tool-augmented computation-intensive math reasoning.Advances in Neural Information Processing Systems, 36:23570–23589, 2023. A Dataset Construction Details This appendix provides full algorithmic details for the six-stage data construction pipeli...

work page 2023

[19] [19]

A node is instantiated from this template and designated as the win node; the puzzle’s success condition is defined as producing the correct output of this node

Sample final-goal node.A template with at least one input port is selected uniformly at random. A node is instantiated from this template and designated as the win node; the puzzle’s success condition is defined as producing the correct output of this node

work page

[20] [20]

Initialise pending queue.All input ports of the final-goal node are enqueued as pending requirements, each represented as a tuple (v, p, τ, r): target node v, target port p, required type τ, and retry countr= 0. 11 Algorithm 1DAG Skeleton Generation via Reverse Growth Require:Target node countn, template libraryT Ensure:DAG skeletonG= (V, E) 1:Select a ra...

work page

[21] [21]

The output is a structuredsource_initmap attached to the DAG metadata

Drop-node outputs: if a node’s triggering input is of type ITEM/HIDDEN_ITEM (indicating activation by a physical object from a container), its output ports that feed downstream nodes are also flagged, because these values must be materialised before forward execution can propagate through them. The output is a structuredsource_initmap attached to the DAG ...

work page

[22] [22]

Triggered when the server reports missing parameters, an empty parameter set, or a partial parameter submission

Missing required parameter— The agent omits one or more mandatory input parameters when invoking a tool or item. Triggered when the server reports missing parameters, an empty parameter set, or a partial parameter submission

work page

[23] [23]

Wrong node type— The agent attempts an operation incompatible with the node’s type, such as callinguse_itemwith arguments on a simple Item that should be accessed viainvestigate

work page

[24] [24]

Repeated solved node— The agent attempts to unlock an item that has already been successfully solved in a previous step, wasting a turn

work page

[25] [25]

Wrong parameter type— The agent provides a parameter value with an incorrect data type (e.g., passing an integer where a hex string is expected)

work page

[26] [26]

Node not visible— The agent attempts to invoke a node that does not currently exist in the environment, or mistakenly treats a non-node entity as an invocable node

work page

[27] [27]

Node not exist— The agent references a node ID that does not exist in the current scenario, typically due to hallucinating a node name or confusing template names with instance-specific IDs

work page

[28] [28]

calc_modexp

Wrong format— The agent’s tool-call request is structurally malformed, missing essential fields such asnode_id. 8.Other— Errors that do not match any of the above classification rules. C Example Solving Trajectory We present a complete interaction trace of Claude-Opus-4.6 solving a difficulty-10 instance. The agent successfully solves the puzzle in 12 tur...

work page

[29] [29]

**text_bidi_sanitize** password for zip (input: `ComplexPassphrase_123!`)

work page

[30] [30]

**zip_unzip_file** file_id for graph

work page

[31] [31]

ComplexPassphrase_123!

**graph_shortest_path** cost (used in calc_gcd and calc_big [...reasoning...] >> text_bidi_sanitize(text="ComplexPassphrase_123!") >> checksum_crc32(hex="4a2b3c4d5e6f708192a3b4...") >> num_base_convert_83f4(from_base="10", s="9182736450192837465192...", to_base="10") Tool Tool text_bidi_sanitize invoked Output: {"text": "ComplexPassphrase_123!"} All param...

work page