arxiv: 2605.01604 · v1 · submitted 2026-05-02 · 💻 cs.AI

Recognition: unknown

Evaluating Agentic AI in the Wild: Failure Modes, Drift Patterns, and a Production Evaluation Framework

Mukund Pandey

Authors on Pith no claims yet

Pith reviewed 2026-05-09 13:58 UTC · model grok-4.3

classification 💻 cs.AI

keywords agentic AIevaluation frameworkfailure modesproduction systemsoutput driftcontinuous evaluationLLM metrics

0 comments

The pith

Standard metrics fail to detect four of seven failure modes in production agentic AI and lag on the others.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper contends that lab-designed evaluation frameworks for AI models do not suffice for agentic systems that operate continuously at production scale. These systems face issues like error compounding over decisions, tool failure cascades, and output drift without clear ground truth for complex tasks. Observations from billion-event operations reveal seven distinct failure modes, and tests show that common metrics such as ROUGE, BERTScore, and accuracy miss four entirely while detecting three only after repeated cycles. In response, the authors introduce PAEF, a framework with five evaluation dimensions meant for ongoing assessment on real production traffic, along with an open-source reference implementation.

Core claim

The authors establish that production agentic AI systems display seven failure modes grounded in large-scale observations, including compounding decision errors and non-deterministic output drift, which existing metrics and benchmarks do not catch promptly. Specifically, standard approaches detect only three modes and only after a lag, leaving four undetected. The core contribution is PAEF, a production-oriented evaluation framework that shifts from episodic lab tests to continuous monitoring on live data.

What carries the argument

PAEF, the Production Agentic Evaluation Framework, a five-dimension structure for continuous evaluation on production traffic.

Load-bearing premise

The seven failure modes observed in billion-event production systems are unique to agentic AI and not sufficiently addressed by current lab-scale benchmarks.

What would settle it

Deploy a production agentic system and track its performance over many evaluation cycles using only standard metrics to determine if the four missed failure modes appear or remain hidden.

Figures

Figures reproduced from arXiv: 2605.01604 by Mukund Pandey.

**Figure 1.** Figure 1: High-level architecture of PAEF. Each metric module targets one or more failure mode families and exposes a MetricResult object with a normalised score, confidence estimate, and structured metadata for downstream alerting or dashboarding. All metrics run locally on sentence-transformers — no external API calls, no rate limits view at source ↗

read the original abstract

Existing evaluation frameworks for large language models -- including HELM, MT-Bench, AgentBench, and BIG-bench -- are designed for controlled, single-session, lab-scale settings. They do not address the evaluation challenges that emerge when agentic AI systems operate continuously in production: compounding decision errors, tool failure cascades, non-deterministic output drift, and the absence of ground truth for long-horizon tasks. This paper makes three contributions. First, we present a taxonomy of seven failure modes unique to production agentic systems, each grounded in observations from systems operating at billion-event scale. Second, we demonstrate empirically where standard metrics -- ROUGE, BERTScore, accuracy/AUC, and the agentic benchmarks above -- fail to detect each failure mode. Third, we propose PAEF (Production Agentic Evaluation Framework), a five-dimension evaluation framework with an open-source reference implementation, designed for continuous evaluation on production traffic rather than episodic benchmark runs. Our analysis shows that standard metrics fail to detect four of the seven failure modes entirely and detect three others only after a lag of multiple evaluation cycles.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper sketches a taxonomy of seven production failure modes for agentic AI plus the PAEF framework, but its claims that standard metrics miss most of them rest on observations whose independence from those metrics is not shown.

read the letter

The main takeaway is a taxonomy of seven failure modes drawn from billion-event production runs, paired with a five-dimension PAEF framework meant for continuous monitoring instead of one-off lab tests. That focus on long-running issues like error compounding, tool cascades, and output drift is the part worth noting right away. Existing benchmarks such as HELM, MT-Bench, AgentBench, and BIG-bench are built for single sessions with clear ground truth, so they miss the realities of nonstop operation. The paper flags this gap plainly and offers an open-source reference implementation, which at least lets others try the approach on their own traffic. That is concrete and useful for anyone actually shipping these systems. The empirical part is thinner. The abstract states that standard metrics like ROUGE, BERTScore, and accuracy miss four modes entirely and lag on the other three, yet it gives no detail on how the modes were first spotted or labeled without relying on the very signals those metrics ignore. If the identification used production traces outside the metrics' reach, such as long-horizon drift or missing ground truth, then the non-detection result is built in rather than tested. Without per-mode traces or a clear description of the labeling process, the central claim stays hard to verify. The work stays away from fitted equations or self-referential parameters, so there is no obvious circularity there. This is aimed at practitioners running agentic systems at scale and at applied researchers who need monitoring tools beyond episodic benchmarks. A reader managing high-volume deployments could borrow the taxonomy to audit their own logs, even if the evidence needs tightening. It deserves peer review because the practical problem is real and the framework is specific enough to critique. Referees should ask for the exact identification method and example data traces before any stronger claims are accepted. I would send it for review with those requests rather than desk reject.

Referee Report

3 major / 2 minor

Summary. The paper claims that lab-oriented benchmarks (HELM, MT-Bench, AgentBench, BIG-bench) and metrics (ROUGE, BERTScore, accuracy/AUC) are inadequate for production agentic AI because they cannot capture compounding decision errors, tool failure cascades, non-deterministic output drift, and absent long-horizon ground truth. Drawing on observations from billion-event-scale production deployments, it presents a taxonomy of seven failure modes unique to such systems, empirically demonstrates that the listed metrics miss four modes entirely and detect the remaining three only after multi-cycle lags, and introduces the PAEF five-dimension framework together with an open-source reference implementation for continuous evaluation on live traffic.

Significance. If the empirical mapping from production observations to metric failures is robust, the work would be significant for shifting evaluation research from episodic lab benchmarks toward production-grade, continuous monitoring of agentic systems. The billion-event observational scale and the release of an open-source PAEF implementation are concrete strengths that could support reproducibility and adoption.

major comments (3)

[Taxonomy and empirical demonstration sections] The central empirical claim (abstract and § on empirical demonstration) that standard metrics fail to detect four of the seven failure modes entirely requires an explicit account of how the modes were first identified and labeled from the production traffic. The manuscript does not describe the annotation or detection procedure used to establish the ground-truth modes independently of ROUGE, BERTScore, accuracy/AUC, or the cited benchmarks; without this, the reported non-detection risks being tautological rather than a genuine empirical test.
[Methods / Data description] No methodological details are supplied on the billion-event dataset itself (system types, time span, sampling, or statistical criteria used to surface the seven modes and drift patterns). This omission prevents assessment of selection bias, reproducibility, or whether the taxonomy generalizes beyond the authors' specific deployments.
[PAEF framework proposal] The PAEF framework is presented as addressing the identified gaps, yet the manuscript contains no comparative or ablation results showing that its five dimensions detect the seven modes earlier or more reliably than the baselines on the same production traffic.

minor comments (2)

[Abstract] The abstract lists the seven failure modes only by high-level description; enumerating them explicitly would improve readability.
[References] Ensure complete bibliographic entries for all referenced benchmarks (HELM, MT-Bench, AgentBench, BIG-bench) and any prior production-evaluation literature.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful and detailed review of our manuscript. The comments highlight important areas for improving clarity and rigor, particularly regarding methodological transparency. We will make revisions to address the concerns about the taxonomy identification process and data description. For the PAEF framework, we will clarify its scope as a proposed evaluation approach without claiming immediate empirical superiority in this submission.

read point-by-point responses

Referee: The central empirical claim (abstract and § on empirical demonstration) that standard metrics fail to detect four of the seven failure modes entirely requires an explicit account of how the modes were first identified and labeled from the production traffic. The manuscript does not describe the annotation or detection procedure used to establish the ground-truth modes independently of ROUGE, BERTScore, accuracy/AUC, or the cited benchmarks; without this, the reported non-detection risks being tautological rather than a genuine empirical test.

Authors: We agree that the manuscript would benefit from an explicit description of how the failure modes were identified and labeled. The seven modes were derived from a qualitative analysis of production incident reports and log patterns observed across multiple agentic deployments. Labeling was performed by a team of engineers and domain experts using a set of predefined criteria for each mode, established prior to applying any quantitative metrics. To resolve the potential tautology concern, we will insert a new subsection detailing this procedure, including examples of the criteria and how independence from the standard metrics was maintained during initial identification. revision: yes
Referee: No methodological details are supplied on the billion-event dataset itself (system types, time span, sampling, or statistical criteria used to surface the seven modes and drift patterns). This omission prevents assessment of selection bias, reproducibility, or whether the taxonomy generalizes beyond the authors' specific deployments.

Authors: We recognize the need for these details to enable proper evaluation of the work. The observations stem from production systems including customer service agents and internal workflow automators, collected over an 18-month period from deployments handling high-volume traffic. Sampling involved stratified selection based on event volume and detected anomalies to ensure coverage of rare failure events. We will update the manuscript with a dedicated Data Description section incorporating system types, time span, sampling methodology, and the criteria used to identify drift patterns. Note that full raw data access is restricted by confidentiality, but we will provide aggregated statistics and pseudocode for the surfacing process. revision: partial
Referee: The PAEF framework is presented as addressing the identified gaps, yet the manuscript contains no comparative or ablation results showing that its five dimensions detect the seven modes earlier or more reliably than the baselines on the same production traffic.

Authors: The absence of such comparative results is accurate. This paper focuses on characterizing the failure modes through observational analysis and proposing the PAEF framework as a response to the identified limitations in existing approaches, accompanied by an open-source implementation. We do not include ablation studies here as they would require a separate experimental design and are intended for follow-up work. In the revision, we will explicitly note in the framework section and conclusion that while PAEF is designed to address the gaps, its relative performance on production data remains to be quantified in future studies, and the open-source release is intended to support such investigations by the community. revision: no

Circularity Check

0 steps flagged

No circularity: taxonomy and metric-failure claims rest on independent production observations.

full rationale

The paper derives its seven-mode taxonomy from direct observations at billion-event production scale and then reports an empirical comparison showing where standard metrics (ROUGE, BERTScore, accuracy/AUC, HELM, etc.) fail to detect them. No equations, fitted parameters, self-citations, or definitional reductions are present that would make any claim equivalent to its inputs by construction. The labeling procedure and per-mode traces are asserted to be external to the metrics being evaluated, rendering the chain self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claims rest on the assumption that production-scale observations reveal failure modes absent from lab benchmarks, plus the proposal of a new five-dimension framework. No numeric parameters are fitted or mentioned.

axioms (1)

domain assumption Existing evaluation frameworks are designed only for controlled, single-session, lab-scale settings and do not address production challenges such as compounding errors and lack of ground truth.
This premise is stated directly in the abstract as the motivation for the new taxonomy and PAEF.

invented entities (1)

PAEF (Production Agentic Evaluation Framework) no independent evidence
purpose: Continuous evaluation on production traffic using five dimensions
New framework proposed by the authors to address gaps in prior benchmarks.

pith-pipeline@v0.9.0 · 5486 in / 1587 out tokens · 58059 ms · 2026-05-09T13:58:24.971086+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

16 extracted references · 10 canonical work pages · 8 internal anchors

[1]

Concrete Problems in AI Safety

D. Amodei, C. Olah, J. Steinhardt, P. Chris- tiano, J. Schulman, and D. Mané. Con- crete problems in AI safety.arXiv preprint arXiv:1606.06565, 2016. URL https: //arxiv.org/abs/1606.06565

work page internal anchor Pith review arXiv 2016
[2]

Evidently: Open- source ml monitoring and observability,

Evidently AI. Evidently: Open- source ml monitoring and observability,
[3]

URL https://github.com/ evidentlyai/evidently
[4]

C. E. Jiménez, J. Yang, A. Wettig, S. Yao, K. Pei, O. Press, and K. Narasimhan. SWE- bench: Can language models resolve real- world GitHub issues?arXiv preprint arXiv:2310.06770, 2024. URL https: //arxiv.org/abs/2310.06770

work page internal anchor Pith review arXiv 2024
[5]

Krakovna, J

V . Krakovna, J. Uesato, V . Mikulik, M. Martic, T. Schaul, M. Harnad, R. Kumar, and J. Leike. Specification gaming: The flip side of AI ingenu- ity.DeepMind Blog, 2020. URL https://deepmind.google/ discover/blog/specification- gaming-the-flip-side-of-ai- ingenuity/

2020
[6]

Holistic Evaluation of Language Models

P. Liang, R. Bommasani, T. Lee, D. Tsipras, D. Soylu, M. Yasunaga, Y . Zhang, D. Narayanan, Y . Wu, A. Kumar, B. Newman, B. Yuan, B. Yan, C. Zhang, C. Cosgrove, C. D. Manning, C. Ré, D. Acosta-Navas, D. A. Hudson, E. Ze- likman, E. Durmus, F. Ladhak, F. Rong, H. Ren, H. Yao, J. Wang, K. Santhanam, L. Orr, L. Zheng, M. Yuksekgonul, M. Suzgun, N. Kim, N. Gu...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[7]

C. Lin. ROUGE: A package for automatic evaluation of summaries. InText Summa- rization Branches Out: Proceedings of the ACL-04 Workshop, pages 74–81. Associa- tion for Computational Linguistics, 2004

2004
[8]

X. Liu, H. Yu, H. Zhang, Y . Xu, X. Lei, H. Lai, Y . Gu, H. Ding, K. Men, K. Yang, S. Zhang, X. Deng, A. Zeng, Z. Liu, Y . Dong, and J. Tang. AgentBench: Eval- uating LLMs as agents.arXiv preprint arXiv:2308.03688, 2023. URL https: //arxiv.org/abs/2308.03688

work page internal anchor Pith review arXiv 2023
[9]

S. M. Lundberg and S. Lee. A unified ap- proach to interpreting model predictions. InAdvances in Neural Information Pro- cessing Systems (NeurIPS), pages 4765– 4774, 2017. URL https://arxiv. org/abs/1705.07874

work page Pith review arXiv 2017
[10]

why should I trust you?

M. T. Ribeiro, S. Singh, and C. Guestrin. "why should I trust you?": Explaining the predictions of any classifier. InProceed- ings of the 22nd ACM SIGKDD Interna- tional Conference on Knowledge Discov- ery and Data Mining, pages 1135–1144,
[11]

URL https://arxiv.org/ abs/1602.04938

work page arXiv
[12]

Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models

A. Srivastava, A. Rastogi, A. Rao, A. A. M. Shoeb, A. Abid, A. Fisch, A. R. Brown, A. Santoro, A. Gupta, A. Garriga-Alonso, et al. Beyond the imitation game: Quantify- ing and extrapolating the capabilities of lan- guage models.Transactions on Machine Learning Research, 2022. URL https: //arxiv.org/abs/2206.04615

work page internal anchor Pith review arXiv 2022
[13]

BERTScore: Evaluating Text Generation with BERT

T. Zhang, V . Kishore, F. Wu, K. Q. Wein- berger, and Y . Artzi. BERTScore: Eval- uating text generation with BERT. InIn- ternational Conference on Learning Rep- resentations (ICLR), 2020. URL https: //arxiv.org/abs/1904.09675

work page internal anchor Pith review arXiv 2020
[14]

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

L. Zheng, W. Chiang, Y . Sheng, S. Zhuang, Z. Wu, Y . Zhuang, Z. Lin, Z. Li, D. Li, E. P. Xing, H. Zhang, J. E. Gonzalez, and I. Stoica. Judging LLM-as-a-Judge with MT-Bench and chatbot arena. InAd- vances in Neural Information Processing Systems (NeurIPS), 2023. URL https: //arxiv.org/abs/2306.05685

work page internal anchor Pith review arXiv 2023
[15]

S. Zhou, F. F. Xu, H. Zhu, X. Zhou, R. Lo, A. Sridhar, X. Cheng, Y . Bisk, D. Fried, U. Alon, and G. Neubig. WebArena: A realistic web environment for building au- tonomous agents. InInternational Confer- ence on Learning Representations (ICLR),
[16]

URL https://arxiv.org/ abs/2307.13854. 12

work page internal anchor Pith review arXiv