AgentAtlas: Beyond Outcome Leaderboards for LLM Agents

Kasra Mazaheri; Parsa Mazaheri

arxiv: 2605.20530 · v2 · pith:CPERPJ27new · submitted 2026-05-19 · 💻 cs.AI · cs.CL· cs.LG· cs.SE

AgentAtlas: Beyond Outcome Leaderboards for LLM Agents

Parsa Mazaheri , Kasra Mazaheri This is my paper

Pith reviewed 2026-05-21 06:31 UTC · model grok-4.3

classification 💻 cs.AI cs.CLcs.LGcs.SE

keywords LLM agentsevaluation benchmarkstrajectory diagnosiscontrol taxonomyfailure analysisprompt supervision

0 comments

The pith

Explicit control labels in prompts are essential for high-performing LLM agent evaluations

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language model agents operate in complex environments but current benchmarks rely too much on final task success rates. This paper develops a control-decision taxonomy with six states and a failure taxonomy with nine categories to better analyze agent trajectories. It compares evaluations where models see these labels in prompts versus when they do not. The results show a consistent drop in accuracy to a narrow range for all models tested when labels are removed. This indicates that many apparent agent capabilities depend on the specific supervision provided during testing.

Core claim

The paper claims that agent evaluations must move beyond single accuracy columns by using a six-state control taxonomy and a nine-category trajectory-failure taxonomy, and that a taxonomy-aware versus taxonomy-blind test reveals how much of measured performance comes from prompt supervision rather than intrinsic capability.

What carries the argument

The taxonomy-aware versus taxonomy-blind methodology that measures the contribution of explicit label menus to trajectory accuracy.

If this is right

Trajectory accuracy depends on the presence of explicit decision and failure labels.
No single model leads across control accuracy, diagnosis quality, and tool-context retention.
Existing agent benchmarks cover only a subset of the six behavioral axes identified.
Performance floors appear independent of model family when supervision is minimized.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Developers could use these taxonomies to create more robust training objectives for agents.
The methodology might help identify which benchmarks are most informative for real-world deployment.
Uniform performance without labels points to shared architectural limits in current LLMs for autonomous operation.

Load-bearing premise

The six-state control taxonomy and nine-category failure taxonomy are complete and non-overlapping enough to classify behaviors across the fifteen benchmarks without major gaps.

What would settle it

Testing the eight models on new agent tasks outside the original fifteen benchmarks and checking whether accuracy still drops uniformly into the 0.54-0.62 range without labels.

Figures

Figures reproduced from arXiv: 2605.20530 by Kasra Mazaheri, Parsa Mazaheri.

**Figure 2.** Figure 2: τ -bench passk decay (Overall split, 2026 Sierra leaderboard snapshot). Eight submissions, one color per (model, reasoning). Claude Opus 4.5 wins at pass1 (0.70) but Qwen3.5-397B-A17B wins at pass4 (0.56). The GPT-5.2 reasoning-on vs. reasoning-off pair (+14 pp pass1 , +23 pp pass4 ) shows the axis responds to interventions. 6 Applying AgentAtlas to Benchmark Coverage The audit scores each benchmark on a s… view at source ↗

**Figure 3.** Figure 3: Coverage by axis. Each row aggregates the 15 audited benchmarks by their score on that axis (cobalt = [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Per-model radar grid over (control accuracy, trajectory label accuracy, tool-context utility retention) under [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

read the original abstract

Large language model agents now act on codebases, browsers, operating systems, calendars, files, and tool ecosystems, but their evaluations often collapse behavior into final task success. AgentAtlas reframes agent evaluation as a diagnostic vocabulary and audit protocol for separating outcome success from control-decision quality and trajectory quality. The paper contributes: (i) a six-state control-decision taxonomy (Act / Ask / Refuse / Stop / Confirm / Recover); (ii) a trajectory-failure vocabulary with primary error source and downstream impact; (iii) a 0/1/2 benchmark-coverage audit over fifteen agent benchmarks; and (iv) an illustrative protocol study on a synthetic 1,342-item set evaluated with eight models under taxonomy-aware and taxonomy-blind prompt formats. The synthetic demonstration is not a public benchmark release and should not be read as a definitive model comparison. Instead, it illustrates two measurement risks: mapped label agreement can change substantially when the explicit label menu is removed, and axis choice can change apparent rankings. AgentAtlas is intended to help benchmark designers state what behavior they cover, and to help evaluators diagnose failures that outcome-only leaderboards hide.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript presents AgentAtlas as a framework extending 2024-2025 work on LLM agent evaluation beyond single accuracy metrics. It introduces a six-state control-decision taxonomy (Act/Ask/Refuse/Stop/Confirm/Recover) and a nine-category trajectory-failure taxonomy (with primary_error_source and impact labels), a taxonomy-aware versus taxonomy-blind prompting methodology to isolate the effect of explicit supervision, and a benchmark-coverage audit across fifteen agent benchmarks. In a demonstration run on a fixed set of eight models (four closed, four open-weight) producing 1,342 items, removing the explicit label menu from prompts drops trajectory accuracy by 14-40 pp to a 0.54-0.62 floor independent of model family, with no model dominating all three reported metrics (control accuracy, trajectory diagnosis, tool-context utility retention). The synthetic run is positioned as a measurement-protocol demonstration rather than a benchmark release.

Significance. If the taxonomies are shown to be robust, the work usefully demonstrates that much of current agent performance on benchmarks may derive from prompt supervision rather than intrinsic capability, producing a surprisingly tight performance floor once that supervision is removed. The taxonomy-aware/blind contrast and the coverage audit provide concrete tools for more diagnostic evaluation. The explicit framing as a protocol demonstration rather than leaderboard is a strength that keeps the scope proportionate.

major comments (2)

[Taxonomy definitions and demonstration setup] The central claim of a model-family-independent accuracy floor (0.54-0.62) and the comparative statement that no model wins on all three metrics rest on reliable assignment of the 1,342 trajectories to the six control states and nine failure categories. The manuscript provides no inter-annotator agreement, coverage audit, or expert validation that the taxonomies are exhaustive and disjoint across the fifteen benchmarks (see the demonstration setup and taxonomy definitions). Without these, both the reported drop magnitudes and the cross-family invariance remain sensitive to label choice.
[Benchmark-coverage audit] The benchmark-coverage audit is described as mapping fifteen benchmarks against six behavioral axes, yet no quantitative summary (e.g., coverage percentages or gaps per axis) is supplied. This weakens the claim that the chosen taxonomies are broadly applicable for diagnosis.

minor comments (2)

[Abstract] The abstract states the run is a 'demonstration' but could more explicitly note that the 1,342 items and eight-model set are not intended as a released benchmark or leaderboard.
[Taxonomy definitions] Notation for the two orthogonal hierarchical labels (primary_error_source, impact) in the nine-category taxonomy would benefit from an explicit example table showing how a single trajectory receives both labels.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful review and constructive comments on our manuscript. We address each of the major comments below and have made revisions to the manuscript where appropriate to strengthen the presentation of our taxonomies and audit.

read point-by-point responses

Referee: [Taxonomy definitions and demonstration setup] The central claim of a model-family-independent accuracy floor (0.54-0.62) and the comparative statement that no model wins on all three metrics rest on reliable assignment of the 1,342 trajectories to the six control states and nine failure categories. The manuscript provides no inter-annotator agreement, coverage audit, or expert validation that the taxonomies are exhaustive and disjoint across the fifteen benchmarks (see the demonstration setup and taxonomy definitions). Without these, both the reported drop magnitudes and the cross-family invariance remain sensitive to label choice.

Authors: We agree that formal validation metrics such as inter-annotator agreement would enhance the reliability of the reported results. The trajectories were annotated by the authors using the provided taxonomy definitions, with iterative refinement to ensure consistency. However, we recognize this as a limitation of the current demonstration. We have revised the manuscript to include a detailed description of the annotation process in the demonstration setup section and added a note on the potential sensitivity to labeling choices. Additionally, we plan to incorporate a small IAA study in future extensions of this work. revision: partial
Referee: [Benchmark-coverage audit] The benchmark-coverage audit is described as mapping fifteen benchmarks against six behavioral axes, yet no quantitative summary (e.g., coverage percentages or gaps per axis) is supplied. This weakens the claim that the chosen taxonomies are broadly applicable for diagnosis.

Authors: We thank the referee for pointing out this omission. The benchmark-coverage audit was performed by systematically reviewing each of the fifteen benchmarks against the six behavioral axes defined in the taxonomy. We have now added a quantitative summary in the form of a table showing coverage percentages for each axis across the benchmarks, along with identified gaps. This revision provides concrete evidence supporting the applicability of the taxonomies. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical results from prompt variants on newly introduced taxonomies

full rationale

The paper introduces the six-state control-decision taxonomy and nine-category failure taxonomy as new constructs rather than deriving them from prior equations or self-referential definitions. The central findings (14-40 pp accuracy drop to a 0.54-0.62 floor, and lack of a single dominating model) are obtained by directly running the eight models on 1,342 items under two explicit prompt conditions (taxonomy-aware with label menu vs. taxonomy-blind). No fitted parameters, predictions that reduce to inputs by construction, or load-bearing self-citations appear in the derivation chain. The benchmark-coverage audit and methodology are presented as measurement protocols applied to the generated trajectories, not as outputs forced by the inputs. This is a standard empirical demonstration with independent content.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The work rests primarily on the assumption that the newly defined taxonomies are useful and reasonably exhaustive; no numerical parameters are fitted to produce the reported accuracy floors.

axioms (1)

domain assumption The six-state control-decision taxonomy and nine-category failure taxonomy together capture the relevant behavioral distinctions for agent evaluation.
The methodology and benchmark audit depend on these taxonomies being adequate; the abstract presents them as extensions without external validation data.

invented entities (1)

Taxonomy-aware versus taxonomy-blind prompt methodology no independent evidence
purpose: To isolate the contribution of explicit supervision to measured agent performance.
This comparison is introduced by the paper to quantify prompt effects.

pith-pipeline@v0.9.0 · 5803 in / 1396 out tokens · 41077 ms · 2026-05-21T06:31:21.470820+00:00 · methodology

AgentAtlas: Beyond Outcome Leaderboards for LLM Agents

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)