Log analysis is necessary for credible evaluation of AI agents

· 2026 · cs.AI · arXiv 2605.08545

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

open full Pith review browse 1 citing papers arXiv PDF

abstract

Agent benchmarks typically report only final outcomes: pass or fail. This threatens evaluation credibility in three ways. First, scores may be inflated or deflated by shortcuts and benchmark artifacts, misrepresenting capability. Second, benchmark performance may fail to predict real-world utility due to scaffold limitations and recurring failure modes. Finally, capability scores may conceal dangerous or catastrophic actions taken by the agent. We argue that log analysis -- the systematic tracking and analysis of the inputs, execution, and outputs of an AI agent -- is necessary to overcome these validity threats and promote credible agent evaluation. In this paper, we (1) present a taxonomy of threats to credible evaluation documented through log analysis, and (2) develop a set of guiding principles for log analysis. We illustrate these principles on tau-Bench Airline, revealing that pass^5 performance was under-elicited by nearly 50% and surfacing deployment failure modes invisible to outcome metrics. We conclude with pragmatic recommendations to increase uptake of log analysis, directed at diverse stakeholders including benchmark creators, model developers, independent evaluators, and deployers.

representative citing papers

Causal methods for LLM development and evaluation

cs.LG · 2026-05-25 · unverdicted · novelty 4.0

Position paper mapping causal inference opportunities across the LLM development pipeline from pretraining to evaluation to address confounding and non-stationarity.

citing papers explorer

Showing 1 of 1 citing paper.

Causal methods for LLM development and evaluation cs.LG · 2026-05-25 · unverdicted · none · ref 60 · internal anchor
Position paper mapping causal inference opportunities across the LLM development pipeline from pretraining to evaluation to address confounding and non-stationarity.

Log analysis is necessary for credible evaluation of AI agents

fields

years

verdicts

representative citing papers

citing papers explorer