Log analysis is necessary for credible evaluation of AI agents

Arvind Narayanan; Conrad Stosz; Cozmin Ududec; Jacob Steinhardt; JJ Allaire; Magda Dubois; Marius Hobbhahn; Nitya Nadgir; Peter Kirgis; Sayash Kapoor

arxiv: 2605.08545 · v1 · submitted 2026-05-08 · 💻 cs.AI

Log analysis is necessary for credible evaluation of AI agents

Peter Kirgis , Sayash Kapoor , Stephan Rabanser , Nitya Nadgir , Cozmin Ududec , Magda Dubois , JJ Allaire , Conrad Stosz

show 3 more authors

Marius Hobbhahn Jacob Steinhardt Arvind Narayanan

This is my paper

Pith reviewed 2026-05-12 02:15 UTC · model grok-4.3

classification 💻 cs.AI

keywords AI agentsbenchmark evaluationlog analysisvalidity threatsagent benchmarksevaluation credibilityshortcut detection

0 comments

The pith

AI agent benchmarks that report only pass or fail outcomes produce misleading evaluations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Agent benchmarks today typically disclose only whether a task succeeded or failed. This narrow view can hide cases where agents take shortcuts, encounter recurring failures that would block real use, or perform unsafe actions that go unnoticed. The paper claims that log analysis, meaning the systematic recording and review of every input, step, and output an agent produces, is required to detect these problems and yield credible assessments. The authors supply a taxonomy of the resulting validity threats plus practical principles for applying log analysis, then demonstrate the approach on an airline task benchmark where standard pass rates understated true performance by nearly half and exposed deployment risks. Adopting the method would force evaluators to confront whether reported scores actually reflect reliable capability.

Core claim

The paper establishes that log analysis—the systematic tracking and analysis of the inputs, execution, and outputs of an AI agent—is necessary to overcome three validity threats in current agent benchmarks. These threats are inflated or deflated scores from shortcuts and artifacts, performance that does not predict real-world utility because of scaffold limits and repeated failure modes, and concealment of dangerous actions. The authors document a taxonomy of such threats through log analysis and set out guiding principles for its use, then apply them to tau-Bench Airline to show that pass^5 performance was under-elicited by nearly 50 percent and to surface failure modes invisible to outcome

What carries the argument

Log analysis, the systematic tracking and analysis of the inputs, execution, and outputs of an AI agent, which serves as the mechanism to surface validity threats that final outcomes alone conceal.

If this is right

Benchmark creators must publish execution traces rather than final scores alone.
Model developers gain concrete data on recurring failure patterns that outcome metrics miss.
Independent evaluators can identify when reported capability rests on benchmark artifacts.
Deployers obtain evidence of unsafe or inefficient behaviors before production use.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Benchmarks that ignore logs may systematically overstate progress toward deployable agents.
Automated log summarization tools could become a standard part of evaluation pipelines.
Safety reviews of agents would need to incorporate execution traces as a required input.

Load-bearing premise

The three validity threats apply generally to agent benchmarks and log analysis can be added at scale without creating new biases or prohibitive costs.

What would settle it

A controlled comparison in which agents evaluated only by pass/fail scores show the same real-world success rates and safety records as agents whose logs were also inspected would falsify the necessity claim.

Figures

Figures reproduced from arXiv: 2605.08545 by Arvind Narayanan, Conrad Stosz, Cozmin Ududec, Jacob Steinhardt, JJ Allaire, Magda Dubois, Marius Hobbhahn, Nitya Nadgir, Peter Kirgis, Sayash Kapoor, Stephan Rabanser.

**Figure 1.** Figure 1: Benchmark outcomes are useful insofar as they track capability (internal validity), that capability transfers to deployment (external validity), and the evaluation surfaces safety-relevant risks (safety evaluation). Log analysis verifies each link. This matters because benchmarks inform deployment decisions, which rest on a chain of inferences that binary outcomes cannot validate ( [PITH_FULL_IMAGE:figu… view at source ↗

**Figure 2.** Figure 2: The log-analysis “sandwich”: inputs and outputs bracket an execution loop, color-coded by [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Illustrations of internal and external validity issues on [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

read the original abstract

Agent benchmarks typically report only final outcomes: pass or fail. This threatens evaluation credibility in three ways. First, scores may be inflated or deflated by shortcuts and benchmark artifacts, misrepresenting capability. Second, benchmark performance may fail to predict real-world utility due to scaffold limitations and recurring failure modes. Finally, capability scores may conceal dangerous or catastrophic actions taken by the agent. We argue that log analysis -- the systematic tracking and analysis of the inputs, execution, and outputs of an AI agent -- is necessary to overcome these validity threats and promote credible agent evaluation. In this paper, we (1) present a taxonomy of threats to credible evaluation documented through log analysis, and (2) develop a set of guiding principles for log analysis. We illustrate these principles on tau-Bench Airline, revealing that pass^5 performance was under-elicited by nearly 50% and surfacing deployment failure modes invisible to outcome metrics. We conclude with pragmatic recommendations to increase uptake of log analysis, directed at diverse stakeholders including benchmark creators, model developers, independent evaluators, and deployers.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Log analysis catches real problems in agent evals that pass/fail scores miss, but the necessity claim needs broader evidence beyond one benchmark.

read the letter

The main point is that AI agent benchmarks relying only on final outcomes can give misleading pictures of capability. The authors identify three validity threats from this practice and push for log analysis—tracking inputs, steps, and outputs—as a way to make evaluations more trustworthy. What stands out is their taxonomy of the threats and the set of guiding principles for doing log analysis. These feel like a useful structure that wasn't laid out before. The tau-Bench Airline example works well here: it demonstrates how pass^5 scores were under-elicited by nearly 50% and brings to light deployment failures that outcome metrics alone don't catch. That's the kind of detail that helps the argument stick. The soft spots come from the limited scope. They use one concrete case to illustrate, but don't provide evidence that the same issues appear across a range of agent benchmarks. The necessity of log analysis follows only if these threats are common and if logging doesn't bring its own problems like high costs or selective analysis. The paper mentions pragmatic recommendations for uptake, but without data on overhead, it's not clear how feasible it is at scale. This paper is for anyone involved in creating or using benchmarks for AI agents, including developers and independent evaluators. It flags an issue that affects how we judge progress in the area. I would send it to peer review. The ideas are worth refining through feedback, and the example gives it a solid foundation even if more validation would help.

Referee Report

3 major / 2 minor

Summary. The manuscript argues that AI agent benchmarks relying solely on final outcome metrics (pass/fail) introduce three validity threats: inflated/deflated scores from shortcuts and artifacts, poor prediction of real-world utility due to scaffold limitations and recurring failures, and concealment of dangerous actions. It claims that systematic log analysis—tracking inputs, execution traces, and outputs—is necessary to mitigate these threats, presents a taxonomy of such threats and a set of guiding principles for log analysis, and demonstrates the approach via an illustration on tau-Bench Airline where pass^5 performance was under-elicited by nearly 50% with hidden deployment failures revealed.

Significance. If the necessity claim and principles hold, the work could meaningfully improve the credibility of agent evaluations by reducing over-optimism and undetected risks, with the tau-Bench case offering a concrete, reproducible example of how outcome metrics alone can mislead. Adoption by benchmark creators and evaluators could lead to more robust practices in the field.

major comments (3)

[Abstract, §1, §5] The central necessity claim (abstract, §1, and conclusion) that log analysis is required across agent benchmarks because outcome-only reporting inherently creates the three threats is load-bearing but rests on a single tau-Bench illustration (§5); no systematic survey or analysis of other benchmarks (e.g., WebArena or AgentBench) is provided to establish generality rather than benchmark-specific issues.
[§3 (taxonomy), §4 (principles), §5 (illustration)] The feasibility component of the necessity argument lacks quantification of logging/analysis overhead, potential new biases (e.g., selective trace selection or interpretation subjectivity), or scalability limits, which is required to show that the proposed taxonomy and principles (§3–4) are net beneficial and preferable to alternatives.
[§1, §6 (recommendations)] The paper does not compare log analysis against alternative mitigations such as richer outcome metrics with built-in checks or improved benchmark designs that reduce artifacts by construction; without this, the claim that log analysis is necessary (rather than one viable option) remains incompletely supported.

minor comments (2)

[§2, §5] Clarify the exact definition and computation of pass^5 and related metrics early in the paper (e.g., via a table or dedicated subsection) to aid readers new to agent evaluation.
[Conclusion] The pragmatic recommendations in the conclusion could include more concrete implementation steps or pseudocode for the guiding principles to increase uptake.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and insightful comments, which help clarify the scope and support for our central claims. We address each major comment point by point below, with planned revisions indicated where the manuscript will be updated to strengthen the presentation.

read point-by-point responses

Referee: [Abstract, §1, §5] The central necessity claim (abstract, §1, and conclusion) that log analysis is required across agent benchmarks because outcome-only reporting inherently creates the three threats is load-bearing but rests on a single tau-Bench illustration (§5); no systematic survey or analysis of other benchmarks (e.g., WebArena or AgentBench) is provided to establish generality rather than benchmark-specific issues.

Authors: We acknowledge that the detailed case study is limited to tau-Bench Airline. The taxonomy in §3 is nevertheless framed as general, drawing on patterns observed across the agent evaluation literature (including shortcut behaviors documented in WebArena and recurring scaffold failures in AgentBench). The necessity argument follows from the logical structure of the three validity threats, which we contend arise whenever only final outcomes are reported. To better demonstrate generality, we will add a new subsection (likely in §3) that explicitly maps each threat category to examples from WebArena, AgentBench, and related benchmarks, citing existing reports of these issues. This revision will clarify applicability without expanding the paper into a full systematic survey, which lies outside the current scope. revision: partial
Referee: [§3 (taxonomy), §4 (principles), §5 (illustration)] The feasibility component of the necessity argument lacks quantification of logging/analysis overhead, potential new biases (e.g., selective trace selection or interpretation subjectivity), or scalability limits, which is required to show that the proposed taxonomy and principles (§3–4) are net beneficial and preferable to alternatives.

Authors: The referee is correct that the manuscript does not quantify overhead, discuss introduced biases, or address scalability limits. Our emphasis was on establishing necessity and articulating principles rather than a complete cost-benefit analysis. In the revised version we will insert a concise feasibility discussion in §4 that (a) provides rough estimates of logging overhead in terms of additional tokens and storage for typical agent traces, (b) acknowledges risks such as selective reporting or interpretive subjectivity and proposes mitigation via automated tooling and inter-rater protocols, and (c) notes scalability considerations for large benchmark suites. These additions will make explicit that the principles are intended to be lightweight and that the benefits for credibility outweigh the modest costs. revision: yes
Referee: [§1, §6 (recommendations)] The paper does not compare log analysis against alternative mitigations such as richer outcome metrics with built-in checks or improved benchmark designs that reduce artifacts by construction; without this, the claim that log analysis is necessary (rather than one viable option) remains incompletely supported.

Authors: We position log analysis as a necessary complement to outcome metrics for addressing the three specific threats, not as the sole possible intervention. The manuscript does not, however, contain an explicit side-by-side comparison with richer outcome metrics or artifact-reduced benchmark designs. In the revision we will expand the recommendations in §6 with a short comparative paragraph. We will argue that while richer metrics and improved designs can reduce certain artifacts, they do not fully resolve concealment of dangerous actions or detection of recurring scaffold failures, both of which require inspection of execution traces. Log analysis can be combined with these alternatives, and we will note this integration as a practical path forward. revision: partial

Circularity Check

0 steps flagged

No circularity: argumentative position with no derivations or self-referential reductions.

full rationale

The paper advances an argumentative claim that outcome-only reporting in agent benchmarks creates three validity threats and that log analysis is therefore necessary for credible evaluation. No mathematical derivations, equations, fitted parameters, or predictions appear in the provided text. The taxonomy of threats is presented as documented through log analysis and illustrated via a single benchmark example (tau-Bench), but this does not reduce the necessity claim to a self-definition or fitted input by construction. There are no self-citations invoked as load-bearing uniqueness theorems, no ansatzes smuggled via prior work, and no renaming of known results. The central assertion remains an externally grounded opinion about evaluation practices rather than a closed logical loop equivalent to its own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on domain assumptions about what constitutes credible evaluation rather than on new mathematical constructs or data fits.

axioms (1)

domain assumption Outcome-only metrics are insufficient for credible agent evaluation
Invoked throughout the abstract as the premise motivating log analysis.

pith-pipeline@v0.9.0 · 5522 in / 1083 out tokens · 33769 ms · 2026-05-12T02:15:49.239489+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

15 extracted references · 15 canonical work pages · 1 internal anchor

[1]

Accessed: 2025-01-18. OpenAI. Introducing operator. https://openai.com/index/introducing-operator/, 2025a. Accessed: 2025-01-18. OpenAI. Function calling. OpenAI Platform Documentation, 2025b. URL https://platform. openai.com/docs/guides/function-calling. Accessed: 2025-01-20. Pan, Y ., Kong, D., Zhou, S., Cui, C., Leng, Y ., Jiang, B., Liu, H., Shang, Y ...

work page internal anchor Pith review arXiv 2025
[2]

The target determines the burden of proof

Define a validity target.Choose whether the goal is to im- prove evaluation fidelity, predict real-world performance, or surface safety-critical actions. The target determines the burden of proof

work page
[3]

Missing context— instructions, actions, outcomes—is the most common blocker

Confirm log coverage.Verify that the harness captures all trajectory components needed for the target. Missing context— instructions, actions, outcomes—is the most common blocker

work page
[4]

Validate on a held-out set with human review

Build and validate a rubric.Start with a general question, read transcripts, and iteratively narrow to specific conditions. Validate on a held-out set with human review

work page
[5]

A failure mode that only appears in already-failed tasks doesn’t threaten score validity

Link labels to outcomes.Compute prevalence by outcome, then risk ratios. A failure mode that only appears in already-failed tasks doesn’t threaten score validity. Bτ-Bench Validation Comparisons This section compares the results from multiple approaches for validating τ-Bench Airline. The goal of this analysis is to establish a set of tasks with low inter...

work page
[6]

Our manual validation effort

work page
[7]

The manual validation of Amazon AGI (Cuadron et al., 2025)

work page 2025
[8]

GPT-5 (medium) tested using Docent

work page
[9]

Claude Sonnet 4.5 tested using Docent B.1 Automated Validation Rubric The following rubric was provided to GPT-5 and Sonnet 4.5 for automated validation of τ-Bench Airline tasks. Each model was given the full transcript, policy text, and answer key for a failed run and asked to determine whether the failure reflected a genuine agent error or a benchmark s...

work page
[10]

no match

Check benchmark outcome.If the run is not marked as a failure, label as “no match” and stop

work page
[11]

no match

Check agent compliance with written policy.Read the policy text and examine the agent’s actions. If the agent clearly violates an explicit requirement or prohibition, label as “no match” and stop. Minor deviations where the policy is ambiguous are treated as compliant

work page
[12]

no match

Check whether the agent reasonably follows the user’s instructions.Compare the agent’s final outcome against the user’s stated requirements. If the agent ignores or contradicts core requirements in a way that cannot be attributed to ambiguity, label as “no match” and stop

work page
[13]

no match

Look for benchmark specification or answer-key issues.Determine whether the failure is attributable to one or more of: (a) Answer key conflicts with policy: expected actions require behavior stricter than or contradicting the written policy. (b) Answer key conflicts with environment results: expected actions depend on database or tool results that do not ...

work page
[14]

no match

Final decision.Label as “match” (benchmark issue) if and only if: the run is marked as a failure, the agent complies with policy, the agent reasonably satisfies user requirements, and the failure is best explained by a benchmark specification issue from step 4. Models were required to provide explanations citing specific transcript evidence for their labe...

work page
[15]

a benchmark for testing AI systems,

Write the explanation.In 1–4 sentences, justify the label by citing where the relevant policy is stated, citing the user’s persuasive or exception-seeking messages (if any), and citing the agent actions or tool calls that either respect or violate the policy. ‘ C Detailed Threats to Evaluation Credibility This appendix expands on the threats surveyed in S...

work page 2025

[1] [1]

Accessed: 2025-01-18. OpenAI. Introducing operator. https://openai.com/index/introducing-operator/, 2025a. Accessed: 2025-01-18. OpenAI. Function calling. OpenAI Platform Documentation, 2025b. URL https://platform. openai.com/docs/guides/function-calling. Accessed: 2025-01-20. Pan, Y ., Kong, D., Zhou, S., Cui, C., Leng, Y ., Jiang, B., Liu, H., Shang, Y ...

work page internal anchor Pith review arXiv 2025

[2] [2]

The target determines the burden of proof

Define a validity target.Choose whether the goal is to im- prove evaluation fidelity, predict real-world performance, or surface safety-critical actions. The target determines the burden of proof

work page

[3] [3]

Missing context— instructions, actions, outcomes—is the most common blocker

Confirm log coverage.Verify that the harness captures all trajectory components needed for the target. Missing context— instructions, actions, outcomes—is the most common blocker

work page

[4] [4]

Validate on a held-out set with human review

Build and validate a rubric.Start with a general question, read transcripts, and iteratively narrow to specific conditions. Validate on a held-out set with human review

work page

[5] [5]

A failure mode that only appears in already-failed tasks doesn’t threaten score validity

Link labels to outcomes.Compute prevalence by outcome, then risk ratios. A failure mode that only appears in already-failed tasks doesn’t threaten score validity. Bτ-Bench Validation Comparisons This section compares the results from multiple approaches for validating τ-Bench Airline. The goal of this analysis is to establish a set of tasks with low inter...

work page

[6] [6]

Our manual validation effort

work page

[7] [7]

The manual validation of Amazon AGI (Cuadron et al., 2025)

work page 2025

[8] [8]

GPT-5 (medium) tested using Docent

work page

[9] [9]

Claude Sonnet 4.5 tested using Docent B.1 Automated Validation Rubric The following rubric was provided to GPT-5 and Sonnet 4.5 for automated validation of τ-Bench Airline tasks. Each model was given the full transcript, policy text, and answer key for a failed run and asked to determine whether the failure reflected a genuine agent error or a benchmark s...

work page

[10] [10]

no match

Check benchmark outcome.If the run is not marked as a failure, label as “no match” and stop

work page

[11] [11]

no match

Check agent compliance with written policy.Read the policy text and examine the agent’s actions. If the agent clearly violates an explicit requirement or prohibition, label as “no match” and stop. Minor deviations where the policy is ambiguous are treated as compliant

work page

[12] [12]

no match

Check whether the agent reasonably follows the user’s instructions.Compare the agent’s final outcome against the user’s stated requirements. If the agent ignores or contradicts core requirements in a way that cannot be attributed to ambiguity, label as “no match” and stop

work page

[13] [13]

no match

Look for benchmark specification or answer-key issues.Determine whether the failure is attributable to one or more of: (a) Answer key conflicts with policy: expected actions require behavior stricter than or contradicting the written policy. (b) Answer key conflicts with environment results: expected actions depend on database or tool results that do not ...

work page

[14] [14]

no match

Final decision.Label as “match” (benchmark issue) if and only if: the run is marked as a failure, the agent complies with policy, the agent reasonably satisfies user requirements, and the failure is best explained by a benchmark specification issue from step 4. Models were required to provide explanations citing specific transcript evidence for their labe...

work page

[15] [15]

a benchmark for testing AI systems,

Write the explanation.In 1–4 sentences, justify the label by citing where the relevant policy is stated, citing the user’s persuasive or exception-seeking messages (if any), and citing the agent actions or tool calls that either respect or violate the policy. ‘ C Detailed Threats to Evaluation Credibility This appendix expands on the threats surveyed in S...

work page 2025