arxiv: 2604.10261 · v2 · submitted 2026-04-11 · 💻 cs.AI · cs.CL· cs.LG

Recognition: unknown

The Amazing Agent Race: Strong Tool Users, Weak Navigators

Zae Myung Kim , Dongseok Lee , Jaehyung Kim , Vipul Raheja , Dongyeop Kang

Authors on Pith no claims yet

Pith reviewed 2026-05-10 15:46 UTC · model grok-4.3

classification 💻 cs.AI cs.CLcs.LG

keywords LLM agentstool use benchmarknavigation failuresDAG puzzlesWikipediaagent evaluationcompositional tasks

0 comments

The pith

Agents fail at navigating pages more than at calling tools in multi-step tasks

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper creates the Amazing Agent Race benchmark to evaluate LLM agents on Wikipedia-based tasks structured as directed acyclic graphs rather than simple linear sequences. It finds that agents struggle primarily with reaching the correct pages, with navigation errors far outnumbering tool-use mistakes. The best performing framework among three tested reaches just 37.2 percent accuracy across 1,400 generated instances. This highlights a gap in existing linear benchmarks that do not expose navigation weaknesses. Sympathetic readers would see this as evidence that agent development needs to target information location skills separately from tool execution.

Core claim

The compositional structure of the Amazing Agent Race benchmark shows that LLM agents are weak navigators despite being strong tool users. On 1,400 procedurally generated legs, the top agent framework scores only 37.2 percent accuracy, with navigation failures occurring in 27 to 52 percent of cases while tool-use errors stay under 17 percent. This navigation issue remains hidden in standard linear benchmarks.

What carries the argument

The Amazing Agent Race (AAR) benchmark using directed acyclic graph (DAG) puzzles that require agents to navigate Wikipedia pages, execute fork-merge tool chains, and aggregate results.

If this is right

Navigation errors dominate agent failures at rates of 27 to 52 percent.
Tool-use errors remain relatively low, below 17 percent.
Agent architecture influences success rates as much as the underlying model scale.
Compositional DAG tasks expose navigation blind spots absent from linear benchmarks.
The highest accuracy achieved is 37.2 percent on the full set of 1,400 legs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Emphasizing navigation training in agent models could yield larger gains than further tool-calling improvements.
Real-world agent deployments might encounter amplified difficulties if they involve less structured information sources than Wikipedia.
New benchmarks should routinely include graph structures to better simulate complex problem-solving.
Developers could test whether adding explicit path-planning components reduces the observed navigation errors.

Load-bearing premise

Procedurally generated DAG legs from Wikipedia seeds with live validation mirror the navigation and reasoning demands of actual multi-step tool-use problems.

What would settle it

Demonstrating that navigation errors do not predominate when agents are tested on human-designed rather than procedurally generated compositional tasks would challenge the central finding.

Figures

Figures reproduced from arXiv: 2604.10261 by Dongseok Lee, Dongyeop Kang, Jaehyung Kim, Vipul Raheja, Zae Myung Kim.

**Figure 1.** Figure 1: (a) Existing benchmarks are 55 to 100% linear; AAR is 0% linear (all DAGs). Numbers in parentheses show mean steps per instance (abbreviated “s”). (b) Best agent accuracy is 36.6% (aggregated across 1,400 legs). (c) Navigation errors dominate (5% to 52%) while tool-use errors stay below 15%. results? Inspired by the television series The Amazing Race (CBS, 2001–present), AAR frames evaluation as a race acr… view at source ↗

**Figure 2.** Figure 2: An example clue envelope (or a “leg”) as presented to the agent. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Diamond pattern structure. AAR introduces diamond patterns ( [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: The eight-step automated pipeline for generating [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: (a) Aggregate results across all 1,400 legs (weighted average of Linear and DAG). FA (finish-line accuracy), PVR (navigation), RCR (tool use). Best FA is 36.6% (Claude + Sonnet 4); PVR is consistently the weakest metric. (b) FA degrades monotonically with difficulty (best: −13.5 pp, worst: −19.0 pp). Per-variant breakdown in Appendix M. Finding 2: Navigation is the primary bottleneck, not tool use. Error d… view at source ↗

**Figure 6.** Figure 6: DAG structure penalizes navigation, not tool use. Having established baseline performance on AAR-Linear, we now examine how compositional DAG structure affects these results. Comparing the two variants (Figure 5a) reveals a consistent pattern across all configurations. Finding 5: Compositionality penalizes navigation, not tool use. As shown in [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: Main results on both benchmark variants: (a) [PITH_FULL_IMAGE:figures/full_fig_p018_7.png] view at source ↗

**Figure 8.** Figure 8: Per-difficulty breakdown on AAR-Linear. Navigation quality degrades far faster than tool-use competence. N Full Results Table AAR-Linear (800 legs) AAR-DAG (600 legs) Agent Model Level FA PVR RCR FA PVR RCR Codex CLI GPT-5.4 Easy 45.0 88.7 82.8 26.0 76.1 86.8 Medium 39.5 71.5 73.4 30.0 55.5 75.5 Hard 32.5 43.8 57.4 31.9 32.6 64.0 Extreme 31.5 37.1 49.2 35.9 24.3 55.3 All 37.1 60.3 65.7 31.7 43.0 68.0 Codex… view at source ↗

read the original abstract

Existing tool-use benchmarks for LLM agents are overwhelmingly linear: our analysis of six benchmarks shows 55 to 100% of instances are simple chains of 2 to 5 steps. We introduce The Amazing Agent Race (AAR), a benchmark featuring directed acyclic graph (DAG) puzzles (or "legs") with fork-merge tool chains. We release 1,400 instances across two variants: sequential (800 legs) and compositional (600 DAG legs). Agents must navigate Wikipedia, execute multi-step tool chains, and aggregate results into a verifiable answer. Legs are procedurally generated from Wikipedia seeds across four difficulty levels with live-API validation. Three complementary metrics (finish-line accuracy, pit-stop visit rate, and roadblock completion rate) separately diagnose navigation, tool-use, and arithmetic failures. Evaluating three agent frameworks on 1,400 legs, the best achieves only 37.2% accuracy. Navigation errors dominate (27 to 52% of trials) while tool-use errors remain below 17%, and agent architecture matters as much as model scale (Claude Code matches Codex CLI at 37% with 6x fewer tokens). The compositional structure of AAR reveals that agents fail not at calling tools but at navigating to the right pages, a blind spot invisible to linear benchmarks. The project page can be accessed at: https://minnesotanlp.github.io/the-amazing-agent-race

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

AAR gives a useful new way to measure navigation failures in tool-using agents that linear benchmarks miss, though the procedural Wikipedia DAGs may not fully capture organic challenges.

read the letter

The core takeaway is that this benchmark separates navigation errors from tool-calling and arithmetic ones, and the numbers show navigation dominating at 27-52% while tool-use stays under 17%. That distinction is the real addition here. Prior work mostly used short linear chains, and the authors checked six of those to make the point. They built 1,400 procedurally generated legs from Wikipedia seeds, split into sequential and compositional DAG versions with fork-merge structure across four difficulty levels, then ran live validation. Three metrics track finish-line accuracy, pit-stop visits, and roadblock completion, which lets them diagnose where agents actually break. On three frameworks the best score is only 37%, and model scale does not clearly beat architecture choices like Claude Code versus Codex CLI. That part lands cleanly and gives a concrete diagnostic the field can use. The soft spot is the generation method itself. Starting from Wikipedia seeds and forcing DAG paths can create unnatural merge points or information dependencies that force navigation mistakes agents would not hit when exploring freely. The stress-test concern holds some weight here because the central claim rests on these legs reflecting real multi-step tool use. If the procedural rules over-constrain the space, the navigation gap looks bigger than it is in the wild. The paper reports the results directly without heavy fitting, which is good, but full details on the exact generation rules and agent implementations would help verify the numbers. This is worth sending to peer review. Researchers working on agent tool-use and multi-step reasoning will find the benchmark and the error breakdown useful for guiding design choices, even if they end up tweaking the generation process later.

Referee Report

3 major / 2 minor

Summary. The manuscript claims that existing LLM agent tool-use benchmarks are predominantly linear (55-100% simple chains in six analyzed benchmarks) and introduces the Amazing Agent Race (AAR) benchmark with 1,400 DAG-structured legs (800 sequential, 600 compositional) generated procedurally from Wikipedia seeds. Using three metrics to separate navigation, tool-use, and arithmetic failures, evaluation of three agent frameworks on these instances shows the best accuracy at 37.2%, with navigation errors dominating at 27-52% while tool-use errors stay below 17%. The key insight is that agents fail at navigating to the right pages rather than calling tools, a blind spot not captured by linear benchmarks.

Significance. If the empirical findings hold, the work is significant as it provides evidence for a previously under-appreciated limitation in agent navigation within compositional settings. The use of live-API validated procedural generation, multiple diagnostic metrics, and the observation that agent architecture can match model scale in performance (Claude Code matching Codex CLI) are notable strengths. Releasing the benchmark instances advances the field by offering a more challenging evaluation suite that could drive improvements in agent design beyond simple tool calling.

major comments (3)

[§3 (Benchmark Construction)] §3 (Benchmark Construction): The procedural generation process for creating fork-merge DAG legs from Wikipedia seeds across four difficulty levels lacks explicit algorithmic details and rules for ensuring natural information flows. This is critical because the central claim of navigation dominance (27-52% errors) depends on these instances not introducing artificial constraints that inflate navigation failures.
[§5 (Experiments)] §5 (Experiments): Details on the three agent frameworks (including Claude Code and Codex CLI) and their specific implementations for handling the DAG navigation are insufficient. Without these, the reported 37.2% accuracy and the claim that architecture matters as much as scale cannot be verified or reproduced.
[§5.3 (Error Analysis)] §5.3 (Error Analysis): The paper reports error rates without statistical significance tests or variance estimates across the 1,400 instances. This weakens the assertion that navigation errors 'dominate' tool-use errors, as it is unclear if the differences are statistically robust.

minor comments (2)

[Abstract] The abstract mentions 'our analysis of six benchmarks' but does not name them; including the list would improve clarity.
[§4 (Metrics)] The definitions of pit-stop visit rate and roadblock completion rate could be more formally specified to aid understanding of how they isolate navigation vs. tool-use failures.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive comments and positive assessment of the significance of our work. We address each of the major comments below and have revised the manuscript accordingly to improve clarity, reproducibility, and statistical rigor.

read point-by-point responses

Referee: §3 (Benchmark Construction): The procedural generation process for creating fork-merge DAG legs from Wikipedia seeds across four difficulty levels lacks explicit algorithmic details and rules for ensuring natural information flows. This is critical because the central claim of navigation dominance (27-52% errors) depends on these instances not introducing artificial constraints that inflate navigation failures.

Authors: We agree that additional details on the procedural generation process would enhance the manuscript. In the revised version, we have expanded §3 with explicit algorithmic pseudocode for generating the fork-merge DAG legs from Wikipedia seeds. We also provide the specific rules used to ensure natural information flows, including how links are selected based on semantic relevance and how difficulty levels are determined and validated via live API checks. These revisions ensure that the instances do not introduce artificial constraints that could inflate navigation errors. revision: yes
Referee: §5 (Experiments): Details on the three agent frameworks (including Claude Code and Codex CLI) and their specific implementations for handling the DAG navigation are insufficient. Without these, the reported 37.2% accuracy and the claim that architecture matters as much as scale cannot be verified or reproduced.

Authors: We acknowledge the need for more detailed descriptions of the agent frameworks to allow for verification and reproduction. In the revised manuscript, we have augmented §5 with comprehensive details on each of the three agent frameworks, including specific implementations for DAG navigation in Claude Code and Codex CLI. This includes descriptions of their navigation strategies, prompt designs, and how they manage the compositional structure. We have also made the implementation code available through the project repository to support reproducibility. revision: yes
Referee: §5.3 (Error Analysis): The paper reports error rates without statistical significance tests or variance estimates across the 1,400 instances. This weakens the assertion that navigation errors 'dominate' tool-use errors, as it is unclear if the differences are statistically robust.

Authors: We appreciate this point and have addressed it by adding statistical analysis to §5.3 in the revised manuscript. We now include variance estimates (standard deviations) across the 1,400 instances and perform statistical significance tests (such as chi-squared tests for error type proportions) to confirm that navigation errors significantly dominate tool-use errors. The updated results show p-values well below 0.01, supporting the robustness of our claims. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical benchmark evaluation is self-contained

full rationale

The paper presents an empirical benchmark (AAR) consisting of procedurally generated DAG legs from Wikipedia seeds, with live-API validation, and reports direct accuracy measurements (37.2% best-case) plus error breakdowns across 1,400 instances. No equations, fitted parameters, or first-principles derivations appear in the provided text or abstract. Central claims rest on observed performance gaps between navigation and tool-use errors rather than any self-referential reduction or self-citation chain. The evaluation is therefore independent of its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that Wikipedia navigation with tool chains forms a representative testbed for agent capabilities; no free parameters or invented entities are introduced.

axioms (1)

domain assumption Wikipedia pages and links provide a suitable live environment for testing agent navigation and multi-step tool use.
Benchmark construction and validation rely on this proxy for real-world information tasks.

pith-pipeline@v0.9.0 · 5567 in / 1225 out tokens · 32229 ms · 2026-05-10T15:46:33.002662+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Securing Computer-Use Agents: A Unified Architecture-Lifecycle Framework for Deployment-Grounded Reliability
cs.CL 2026-05 unverdicted novelty 4.0

The paper develops a unified framework that organizes computer-use agent reliability around perception-decision-execution layers and creation-deployment-operation-maintenance stages to map security and alignment inter...

Reference graph

Works this paper leans on

4 extracted references · cited by 1 Pith paper

[1]

These legs have strong PVR (63.5%) and RCR (71.4%), indicating the agent was on the right track but made a computational error in the final aggregation

Near-misses(20.5% of all trials): The agent achieves ≥80% intermediate value accuracy but produces the wrong finish-line code. These legs have strong PVR (63.5%) and RCR (71.4%), indicating the agent was on the right track but made a computational error in the final aggregation
[2]

These represent tool-chain or computation errors downstream of successful navigation

Perfect-navigation failures(12.8%): The agent visits ≥90% of required pages but still gets the wrong answer, with RCR at 69.2%. These represent tool-chain or computation errors downstream of successful navigation
[3]

These skew toward harder legs (25 hard, 21 extreme), suggesting that experienced tool reasoning can sometimes compensate for navigation failure

Navigation-bypass successes(7.4%): Agents that get the correct answer despite visiting <30% of required pages. These skew toward harder legs (25 hard, 21 extreme), suggesting that experienced tool reasoning can sometimes compensate for navigation failure
[4]

I have two plausible interpretations for the Egypt clue, so I’m checking the actual Wikimedia page behind the search hit

Total failures(8.9% for Codex, 17.6% for mini-swe-agent): Both PVR and RCR below 30%. Mini-swe-agent’s higher rate (2×) reflects its under-exploration strategy. The over-calling paradox.Counter-intuitively,incorrecttrials use more tool calls on av- erage (21.7) than correct trials (16.5) for Codex + GPT-5.4-mini. Agents that fail tend to over-explore rath...