A span-decomposed evaluation framework for AI agents achieves state-of-the-art results on GAIA and SWE-Bench with up to 3.5x gains in localization accuracy by breaking traces into independent per-span judgments.
Title resolution pending
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
fields
cs.AI 2years
2026 2verdicts
UNVERDICTED 2representative citing papers
GUIDE decomposes GUI agent evaluation into trajectory segmentation, subtask diagnosis, and overall summary to deliver higher accuracy and structured error reports than holistic baselines.
citing papers explorer
-
Holistic Evaluation and Failure Diagnosis of AI Agents
A span-decomposed evaluation framework for AI agents achieves state-of-the-art results on GAIA and SWE-Bench with up to 3.5x gains in localization accuracy by breaking traces into independent per-span judgments.
-
GUIDE: Interpretable GUI Agent Evaluation via Hierarchical Diagnosis
GUIDE decomposes GUI agent evaluation into trajectory segmentation, subtask diagnosis, and overall summary to deliver higher accuracy and structured error reports than holistic baselines.