pith. machine review for the scientific record. sign in

arxiv: 2605.01604 · v1 · submitted 2026-05-02 · 💻 cs.AI

Recognition: unknown

Evaluating Agentic AI in the Wild: Failure Modes, Drift Patterns, and a Production Evaluation Framework

Authors on Pith no claims yet

Pith reviewed 2026-05-09 13:58 UTC · model grok-4.3

classification 💻 cs.AI
keywords agentic AIevaluation frameworkfailure modesproduction systemsoutput driftcontinuous evaluationLLM metrics
0
0 comments X

The pith

Standard metrics fail to detect four of seven failure modes in production agentic AI and lag on the others.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper contends that lab-designed evaluation frameworks for AI models do not suffice for agentic systems that operate continuously at production scale. These systems face issues like error compounding over decisions, tool failure cascades, and output drift without clear ground truth for complex tasks. Observations from billion-event operations reveal seven distinct failure modes, and tests show that common metrics such as ROUGE, BERTScore, and accuracy miss four entirely while detecting three only after repeated cycles. In response, the authors introduce PAEF, a framework with five evaluation dimensions meant for ongoing assessment on real production traffic, along with an open-source reference implementation.

Core claim

The authors establish that production agentic AI systems display seven failure modes grounded in large-scale observations, including compounding decision errors and non-deterministic output drift, which existing metrics and benchmarks do not catch promptly. Specifically, standard approaches detect only three modes and only after a lag, leaving four undetected. The core contribution is PAEF, a production-oriented evaluation framework that shifts from episodic lab tests to continuous monitoring on live data.

What carries the argument

PAEF, the Production Agentic Evaluation Framework, a five-dimension structure for continuous evaluation on production traffic.

Load-bearing premise

The seven failure modes observed in billion-event production systems are unique to agentic AI and not sufficiently addressed by current lab-scale benchmarks.

What would settle it

Deploy a production agentic system and track its performance over many evaluation cycles using only standard metrics to determine if the four missed failure modes appear or remain hidden.

Figures

Figures reproduced from arXiv: 2605.01604 by Mukund Pandey.

Figure 1
Figure 1. Figure 1: High-level architecture of PAEF. Each metric module targets one or more failure mode families and exposes a MetricResult object with a normalised score, confidence estimate, and structured metadata for downstream alerting or dashboarding. All metrics run locally on sentence-transformers — no external API calls, no rate limits view at source ↗
read the original abstract

Existing evaluation frameworks for large language models -- including HELM, MT-Bench, AgentBench, and BIG-bench -- are designed for controlled, single-session, lab-scale settings. They do not address the evaluation challenges that emerge when agentic AI systems operate continuously in production: compounding decision errors, tool failure cascades, non-deterministic output drift, and the absence of ground truth for long-horizon tasks. This paper makes three contributions. First, we present a taxonomy of seven failure modes unique to production agentic systems, each grounded in observations from systems operating at billion-event scale. Second, we demonstrate empirically where standard metrics -- ROUGE, BERTScore, accuracy/AUC, and the agentic benchmarks above -- fail to detect each failure mode. Third, we propose PAEF (Production Agentic Evaluation Framework), a five-dimension evaluation framework with an open-source reference implementation, designed for continuous evaluation on production traffic rather than episodic benchmark runs. Our analysis shows that standard metrics fail to detect four of the seven failure modes entirely and detect three others only after a lag of multiple evaluation cycles.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper claims that lab-oriented benchmarks (HELM, MT-Bench, AgentBench, BIG-bench) and metrics (ROUGE, BERTScore, accuracy/AUC) are inadequate for production agentic AI because they cannot capture compounding decision errors, tool failure cascades, non-deterministic output drift, and absent long-horizon ground truth. Drawing on observations from billion-event-scale production deployments, it presents a taxonomy of seven failure modes unique to such systems, empirically demonstrates that the listed metrics miss four modes entirely and detect the remaining three only after multi-cycle lags, and introduces the PAEF five-dimension framework together with an open-source reference implementation for continuous evaluation on live traffic.

Significance. If the empirical mapping from production observations to metric failures is robust, the work would be significant for shifting evaluation research from episodic lab benchmarks toward production-grade, continuous monitoring of agentic systems. The billion-event observational scale and the release of an open-source PAEF implementation are concrete strengths that could support reproducibility and adoption.

major comments (3)
  1. [Taxonomy and empirical demonstration sections] The central empirical claim (abstract and § on empirical demonstration) that standard metrics fail to detect four of the seven failure modes entirely requires an explicit account of how the modes were first identified and labeled from the production traffic. The manuscript does not describe the annotation or detection procedure used to establish the ground-truth modes independently of ROUGE, BERTScore, accuracy/AUC, or the cited benchmarks; without this, the reported non-detection risks being tautological rather than a genuine empirical test.
  2. [Methods / Data description] No methodological details are supplied on the billion-event dataset itself (system types, time span, sampling, or statistical criteria used to surface the seven modes and drift patterns). This omission prevents assessment of selection bias, reproducibility, or whether the taxonomy generalizes beyond the authors' specific deployments.
  3. [PAEF framework proposal] The PAEF framework is presented as addressing the identified gaps, yet the manuscript contains no comparative or ablation results showing that its five dimensions detect the seven modes earlier or more reliably than the baselines on the same production traffic.
minor comments (2)
  1. [Abstract] The abstract lists the seven failure modes only by high-level description; enumerating them explicitly would improve readability.
  2. [References] Ensure complete bibliographic entries for all referenced benchmarks (HELM, MT-Bench, AgentBench, BIG-bench) and any prior production-evaluation literature.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful and detailed review of our manuscript. The comments highlight important areas for improving clarity and rigor, particularly regarding methodological transparency. We will make revisions to address the concerns about the taxonomy identification process and data description. For the PAEF framework, we will clarify its scope as a proposed evaluation approach without claiming immediate empirical superiority in this submission.

read point-by-point responses
  1. Referee: The central empirical claim (abstract and § on empirical demonstration) that standard metrics fail to detect four of the seven failure modes entirely requires an explicit account of how the modes were first identified and labeled from the production traffic. The manuscript does not describe the annotation or detection procedure used to establish the ground-truth modes independently of ROUGE, BERTScore, accuracy/AUC, or the cited benchmarks; without this, the reported non-detection risks being tautological rather than a genuine empirical test.

    Authors: We agree that the manuscript would benefit from an explicit description of how the failure modes were identified and labeled. The seven modes were derived from a qualitative analysis of production incident reports and log patterns observed across multiple agentic deployments. Labeling was performed by a team of engineers and domain experts using a set of predefined criteria for each mode, established prior to applying any quantitative metrics. To resolve the potential tautology concern, we will insert a new subsection detailing this procedure, including examples of the criteria and how independence from the standard metrics was maintained during initial identification. revision: yes

  2. Referee: No methodological details are supplied on the billion-event dataset itself (system types, time span, sampling, or statistical criteria used to surface the seven modes and drift patterns). This omission prevents assessment of selection bias, reproducibility, or whether the taxonomy generalizes beyond the authors' specific deployments.

    Authors: We recognize the need for these details to enable proper evaluation of the work. The observations stem from production systems including customer service agents and internal workflow automators, collected over an 18-month period from deployments handling high-volume traffic. Sampling involved stratified selection based on event volume and detected anomalies to ensure coverage of rare failure events. We will update the manuscript with a dedicated Data Description section incorporating system types, time span, sampling methodology, and the criteria used to identify drift patterns. Note that full raw data access is restricted by confidentiality, but we will provide aggregated statistics and pseudocode for the surfacing process. revision: partial

  3. Referee: The PAEF framework is presented as addressing the identified gaps, yet the manuscript contains no comparative or ablation results showing that its five dimensions detect the seven modes earlier or more reliably than the baselines on the same production traffic.

    Authors: The absence of such comparative results is accurate. This paper focuses on characterizing the failure modes through observational analysis and proposing the PAEF framework as a response to the identified limitations in existing approaches, accompanied by an open-source implementation. We do not include ablation studies here as they would require a separate experimental design and are intended for follow-up work. In the revision, we will explicitly note in the framework section and conclusion that while PAEF is designed to address the gaps, its relative performance on production data remains to be quantified in future studies, and the open-source release is intended to support such investigations by the community. revision: no

Circularity Check

0 steps flagged

No circularity: taxonomy and metric-failure claims rest on independent production observations.

full rationale

The paper derives its seven-mode taxonomy from direct observations at billion-event production scale and then reports an empirical comparison showing where standard metrics (ROUGE, BERTScore, accuracy/AUC, HELM, etc.) fail to detect them. No equations, fitted parameters, self-citations, or definitional reductions are present that would make any claim equivalent to its inputs by construction. The labeling procedure and per-mode traces are asserted to be external to the metrics being evaluated, rendering the chain self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claims rest on the assumption that production-scale observations reveal failure modes absent from lab benchmarks, plus the proposal of a new five-dimension framework. No numeric parameters are fitted or mentioned.

axioms (1)
  • domain assumption Existing evaluation frameworks are designed only for controlled, single-session, lab-scale settings and do not address production challenges such as compounding errors and lack of ground truth.
    This premise is stated directly in the abstract as the motivation for the new taxonomy and PAEF.
invented entities (1)
  • PAEF (Production Agentic Evaluation Framework) no independent evidence
    purpose: Continuous evaluation on production traffic using five dimensions
    New framework proposed by the authors to address gaps in prior benchmarks.

pith-pipeline@v0.9.0 · 5486 in / 1587 out tokens · 58059 ms · 2026-05-09T13:58:24.971086+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

16 extracted references · 10 canonical work pages · 8 internal anchors

  1. [1]

    Concrete Problems in AI Safety

    D. Amodei, C. Olah, J. Steinhardt, P. Chris- tiano, J. Schulman, and D. Mané. Con- crete problems in AI safety.arXiv preprint arXiv:1606.06565, 2016. URL https: //arxiv.org/abs/1606.06565

  2. [2]

    Evidently: Open- source ml monitoring and observability,

    Evidently AI. Evidently: Open- source ml monitoring and observability,

  3. [3]

    URL https://github.com/ evidentlyai/evidently

  4. [4]

    C. E. Jiménez, J. Yang, A. Wettig, S. Yao, K. Pei, O. Press, and K. Narasimhan. SWE- bench: Can language models resolve real- world GitHub issues?arXiv preprint arXiv:2310.06770, 2024. URL https: //arxiv.org/abs/2310.06770

  5. [5]

    Krakovna, J

    V . Krakovna, J. Uesato, V . Mikulik, M. Martic, T. Schaul, M. Harnad, R. Kumar, and J. Leike. Specification gaming: The flip side of AI ingenu- ity.DeepMind Blog, 2020. URL https://deepmind.google/ discover/blog/specification- gaming-the-flip-side-of-ai- ingenuity/

  6. [6]

    Holistic Evaluation of Language Models

    P. Liang, R. Bommasani, T. Lee, D. Tsipras, D. Soylu, M. Yasunaga, Y . Zhang, D. Narayanan, Y . Wu, A. Kumar, B. Newman, B. Yuan, B. Yan, C. Zhang, C. Cosgrove, C. D. Manning, C. Ré, D. Acosta-Navas, D. A. Hudson, E. Ze- likman, E. Durmus, F. Ladhak, F. Rong, H. Ren, H. Yao, J. Wang, K. Santhanam, L. Orr, L. Zheng, M. Yuksekgonul, M. Suzgun, N. Kim, N. Gu...

  7. [7]

    C. Lin. ROUGE: A package for automatic evaluation of summaries. InText Summa- rization Branches Out: Proceedings of the ACL-04 Workshop, pages 74–81. Associa- tion for Computational Linguistics, 2004

  8. [8]

    X. Liu, H. Yu, H. Zhang, Y . Xu, X. Lei, H. Lai, Y . Gu, H. Ding, K. Men, K. Yang, S. Zhang, X. Deng, A. Zeng, Z. Liu, Y . Dong, and J. Tang. AgentBench: Eval- uating LLMs as agents.arXiv preprint arXiv:2308.03688, 2023. URL https: //arxiv.org/abs/2308.03688

  9. [9]

    S. M. Lundberg and S. Lee. A unified ap- proach to interpreting model predictions. InAdvances in Neural Information Pro- cessing Systems (NeurIPS), pages 4765– 4774, 2017. URL https://arxiv. org/abs/1705.07874

  10. [10]

    why should I trust you?

    M. T. Ribeiro, S. Singh, and C. Guestrin. "why should I trust you?": Explaining the predictions of any classifier. InProceed- ings of the 22nd ACM SIGKDD Interna- tional Conference on Knowledge Discov- ery and Data Mining, pages 1135–1144,

  11. [11]

    URL https://arxiv.org/ abs/1602.04938

  12. [12]

    Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models

    A. Srivastava, A. Rastogi, A. Rao, A. A. M. Shoeb, A. Abid, A. Fisch, A. R. Brown, A. Santoro, A. Gupta, A. Garriga-Alonso, et al. Beyond the imitation game: Quantify- ing and extrapolating the capabilities of lan- guage models.Transactions on Machine Learning Research, 2022. URL https: //arxiv.org/abs/2206.04615

  13. [13]

    BERTScore: Evaluating Text Generation with BERT

    T. Zhang, V . Kishore, F. Wu, K. Q. Wein- berger, and Y . Artzi. BERTScore: Eval- uating text generation with BERT. InIn- ternational Conference on Learning Rep- resentations (ICLR), 2020. URL https: //arxiv.org/abs/1904.09675

  14. [14]

    Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

    L. Zheng, W. Chiang, Y . Sheng, S. Zhuang, Z. Wu, Y . Zhuang, Z. Lin, Z. Li, D. Li, E. P. Xing, H. Zhang, J. E. Gonzalez, and I. Stoica. Judging LLM-as-a-Judge with MT-Bench and chatbot arena. InAd- vances in Neural Information Processing Systems (NeurIPS), 2023. URL https: //arxiv.org/abs/2306.05685

  15. [15]

    S. Zhou, F. F. Xu, H. Zhu, X. Zhou, R. Lo, A. Sridhar, X. Cheng, Y . Bisk, D. Fried, U. Alon, and G. Neubig. WebArena: A realistic web environment for building au- tonomous agents. InInternational Confer- ence on Learning Representations (ICLR),

  16. [16]

    URL https://arxiv.org/ abs/2307.13854. 12