Recognition: unknown
Evaluating Agentic AI in the Wild: Failure Modes, Drift Patterns, and a Production Evaluation Framework
Pith reviewed 2026-05-09 13:58 UTC · model grok-4.3
The pith
Standard metrics fail to detect four of seven failure modes in production agentic AI and lag on the others.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors establish that production agentic AI systems display seven failure modes grounded in large-scale observations, including compounding decision errors and non-deterministic output drift, which existing metrics and benchmarks do not catch promptly. Specifically, standard approaches detect only three modes and only after a lag, leaving four undetected. The core contribution is PAEF, a production-oriented evaluation framework that shifts from episodic lab tests to continuous monitoring on live data.
What carries the argument
PAEF, the Production Agentic Evaluation Framework, a five-dimension structure for continuous evaluation on production traffic.
Load-bearing premise
The seven failure modes observed in billion-event production systems are unique to agentic AI and not sufficiently addressed by current lab-scale benchmarks.
What would settle it
Deploy a production agentic system and track its performance over many evaluation cycles using only standard metrics to determine if the four missed failure modes appear or remain hidden.
Figures
read the original abstract
Existing evaluation frameworks for large language models -- including HELM, MT-Bench, AgentBench, and BIG-bench -- are designed for controlled, single-session, lab-scale settings. They do not address the evaluation challenges that emerge when agentic AI systems operate continuously in production: compounding decision errors, tool failure cascades, non-deterministic output drift, and the absence of ground truth for long-horizon tasks. This paper makes three contributions. First, we present a taxonomy of seven failure modes unique to production agentic systems, each grounded in observations from systems operating at billion-event scale. Second, we demonstrate empirically where standard metrics -- ROUGE, BERTScore, accuracy/AUC, and the agentic benchmarks above -- fail to detect each failure mode. Third, we propose PAEF (Production Agentic Evaluation Framework), a five-dimension evaluation framework with an open-source reference implementation, designed for continuous evaluation on production traffic rather than episodic benchmark runs. Our analysis shows that standard metrics fail to detect four of the seven failure modes entirely and detect three others only after a lag of multiple evaluation cycles.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that lab-oriented benchmarks (HELM, MT-Bench, AgentBench, BIG-bench) and metrics (ROUGE, BERTScore, accuracy/AUC) are inadequate for production agentic AI because they cannot capture compounding decision errors, tool failure cascades, non-deterministic output drift, and absent long-horizon ground truth. Drawing on observations from billion-event-scale production deployments, it presents a taxonomy of seven failure modes unique to such systems, empirically demonstrates that the listed metrics miss four modes entirely and detect the remaining three only after multi-cycle lags, and introduces the PAEF five-dimension framework together with an open-source reference implementation for continuous evaluation on live traffic.
Significance. If the empirical mapping from production observations to metric failures is robust, the work would be significant for shifting evaluation research from episodic lab benchmarks toward production-grade, continuous monitoring of agentic systems. The billion-event observational scale and the release of an open-source PAEF implementation are concrete strengths that could support reproducibility and adoption.
major comments (3)
- [Taxonomy and empirical demonstration sections] The central empirical claim (abstract and § on empirical demonstration) that standard metrics fail to detect four of the seven failure modes entirely requires an explicit account of how the modes were first identified and labeled from the production traffic. The manuscript does not describe the annotation or detection procedure used to establish the ground-truth modes independently of ROUGE, BERTScore, accuracy/AUC, or the cited benchmarks; without this, the reported non-detection risks being tautological rather than a genuine empirical test.
- [Methods / Data description] No methodological details are supplied on the billion-event dataset itself (system types, time span, sampling, or statistical criteria used to surface the seven modes and drift patterns). This omission prevents assessment of selection bias, reproducibility, or whether the taxonomy generalizes beyond the authors' specific deployments.
- [PAEF framework proposal] The PAEF framework is presented as addressing the identified gaps, yet the manuscript contains no comparative or ablation results showing that its five dimensions detect the seven modes earlier or more reliably than the baselines on the same production traffic.
minor comments (2)
- [Abstract] The abstract lists the seven failure modes only by high-level description; enumerating them explicitly would improve readability.
- [References] Ensure complete bibliographic entries for all referenced benchmarks (HELM, MT-Bench, AgentBench, BIG-bench) and any prior production-evaluation literature.
Simulated Author's Rebuttal
We thank the referee for the thoughtful and detailed review of our manuscript. The comments highlight important areas for improving clarity and rigor, particularly regarding methodological transparency. We will make revisions to address the concerns about the taxonomy identification process and data description. For the PAEF framework, we will clarify its scope as a proposed evaluation approach without claiming immediate empirical superiority in this submission.
read point-by-point responses
-
Referee: The central empirical claim (abstract and § on empirical demonstration) that standard metrics fail to detect four of the seven failure modes entirely requires an explicit account of how the modes were first identified and labeled from the production traffic. The manuscript does not describe the annotation or detection procedure used to establish the ground-truth modes independently of ROUGE, BERTScore, accuracy/AUC, or the cited benchmarks; without this, the reported non-detection risks being tautological rather than a genuine empirical test.
Authors: We agree that the manuscript would benefit from an explicit description of how the failure modes were identified and labeled. The seven modes were derived from a qualitative analysis of production incident reports and log patterns observed across multiple agentic deployments. Labeling was performed by a team of engineers and domain experts using a set of predefined criteria for each mode, established prior to applying any quantitative metrics. To resolve the potential tautology concern, we will insert a new subsection detailing this procedure, including examples of the criteria and how independence from the standard metrics was maintained during initial identification. revision: yes
-
Referee: No methodological details are supplied on the billion-event dataset itself (system types, time span, sampling, or statistical criteria used to surface the seven modes and drift patterns). This omission prevents assessment of selection bias, reproducibility, or whether the taxonomy generalizes beyond the authors' specific deployments.
Authors: We recognize the need for these details to enable proper evaluation of the work. The observations stem from production systems including customer service agents and internal workflow automators, collected over an 18-month period from deployments handling high-volume traffic. Sampling involved stratified selection based on event volume and detected anomalies to ensure coverage of rare failure events. We will update the manuscript with a dedicated Data Description section incorporating system types, time span, sampling methodology, and the criteria used to identify drift patterns. Note that full raw data access is restricted by confidentiality, but we will provide aggregated statistics and pseudocode for the surfacing process. revision: partial
-
Referee: The PAEF framework is presented as addressing the identified gaps, yet the manuscript contains no comparative or ablation results showing that its five dimensions detect the seven modes earlier or more reliably than the baselines on the same production traffic.
Authors: The absence of such comparative results is accurate. This paper focuses on characterizing the failure modes through observational analysis and proposing the PAEF framework as a response to the identified limitations in existing approaches, accompanied by an open-source implementation. We do not include ablation studies here as they would require a separate experimental design and are intended for follow-up work. In the revision, we will explicitly note in the framework section and conclusion that while PAEF is designed to address the gaps, its relative performance on production data remains to be quantified in future studies, and the open-source release is intended to support such investigations by the community. revision: no
Circularity Check
No circularity: taxonomy and metric-failure claims rest on independent production observations.
full rationale
The paper derives its seven-mode taxonomy from direct observations at billion-event production scale and then reports an empirical comparison showing where standard metrics (ROUGE, BERTScore, accuracy/AUC, HELM, etc.) fail to detect them. No equations, fitted parameters, self-citations, or definitional reductions are present that would make any claim equivalent to its inputs by construction. The labeling procedure and per-mode traces are asserted to be external to the metrics being evaluated, rendering the chain self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Existing evaluation frameworks are designed only for controlled, single-session, lab-scale settings and do not address production challenges such as compounding errors and lack of ground truth.
invented entities (1)
-
PAEF (Production Agentic Evaluation Framework)
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Concrete Problems in AI Safety
D. Amodei, C. Olah, J. Steinhardt, P. Chris- tiano, J. Schulman, and D. Mané. Con- crete problems in AI safety.arXiv preprint arXiv:1606.06565, 2016. URL https: //arxiv.org/abs/1606.06565
work page internal anchor Pith review arXiv 2016
-
[2]
Evidently: Open- source ml monitoring and observability,
Evidently AI. Evidently: Open- source ml monitoring and observability,
-
[3]
URL https://github.com/ evidentlyai/evidently
-
[4]
C. E. Jiménez, J. Yang, A. Wettig, S. Yao, K. Pei, O. Press, and K. Narasimhan. SWE- bench: Can language models resolve real- world GitHub issues?arXiv preprint arXiv:2310.06770, 2024. URL https: //arxiv.org/abs/2310.06770
work page internal anchor Pith review arXiv 2024
-
[5]
Krakovna, J
V . Krakovna, J. Uesato, V . Mikulik, M. Martic, T. Schaul, M. Harnad, R. Kumar, and J. Leike. Specification gaming: The flip side of AI ingenu- ity.DeepMind Blog, 2020. URL https://deepmind.google/ discover/blog/specification- gaming-the-flip-side-of-ai- ingenuity/
2020
-
[6]
Holistic Evaluation of Language Models
P. Liang, R. Bommasani, T. Lee, D. Tsipras, D. Soylu, M. Yasunaga, Y . Zhang, D. Narayanan, Y . Wu, A. Kumar, B. Newman, B. Yuan, B. Yan, C. Zhang, C. Cosgrove, C. D. Manning, C. Ré, D. Acosta-Navas, D. A. Hudson, E. Ze- likman, E. Durmus, F. Ladhak, F. Rong, H. Ren, H. Yao, J. Wang, K. Santhanam, L. Orr, L. Zheng, M. Yuksekgonul, M. Suzgun, N. Kim, N. Gu...
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[7]
C. Lin. ROUGE: A package for automatic evaluation of summaries. InText Summa- rization Branches Out: Proceedings of the ACL-04 Workshop, pages 74–81. Associa- tion for Computational Linguistics, 2004
2004
-
[8]
X. Liu, H. Yu, H. Zhang, Y . Xu, X. Lei, H. Lai, Y . Gu, H. Ding, K. Men, K. Yang, S. Zhang, X. Deng, A. Zeng, Z. Liu, Y . Dong, and J. Tang. AgentBench: Eval- uating LLMs as agents.arXiv preprint arXiv:2308.03688, 2023. URL https: //arxiv.org/abs/2308.03688
work page internal anchor Pith review arXiv 2023
-
[9]
S. M. Lundberg and S. Lee. A unified ap- proach to interpreting model predictions. InAdvances in Neural Information Pro- cessing Systems (NeurIPS), pages 4765– 4774, 2017. URL https://arxiv. org/abs/1705.07874
work page Pith review arXiv 2017
-
[10]
why should I trust you?
M. T. Ribeiro, S. Singh, and C. Guestrin. "why should I trust you?": Explaining the predictions of any classifier. InProceed- ings of the 22nd ACM SIGKDD Interna- tional Conference on Knowledge Discov- ery and Data Mining, pages 1135–1144,
- [11]
-
[12]
Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models
A. Srivastava, A. Rastogi, A. Rao, A. A. M. Shoeb, A. Abid, A. Fisch, A. R. Brown, A. Santoro, A. Gupta, A. Garriga-Alonso, et al. Beyond the imitation game: Quantify- ing and extrapolating the capabilities of lan- guage models.Transactions on Machine Learning Research, 2022. URL https: //arxiv.org/abs/2206.04615
work page internal anchor Pith review arXiv 2022
-
[13]
BERTScore: Evaluating Text Generation with BERT
T. Zhang, V . Kishore, F. Wu, K. Q. Wein- berger, and Y . Artzi. BERTScore: Eval- uating text generation with BERT. InIn- ternational Conference on Learning Rep- resentations (ICLR), 2020. URL https: //arxiv.org/abs/1904.09675
work page internal anchor Pith review arXiv 2020
-
[14]
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena
L. Zheng, W. Chiang, Y . Sheng, S. Zhuang, Z. Wu, Y . Zhuang, Z. Lin, Z. Li, D. Li, E. P. Xing, H. Zhang, J. E. Gonzalez, and I. Stoica. Judging LLM-as-a-Judge with MT-Bench and chatbot arena. InAd- vances in Neural Information Processing Systems (NeurIPS), 2023. URL https: //arxiv.org/abs/2306.05685
work page internal anchor Pith review arXiv 2023
-
[15]
S. Zhou, F. F. Xu, H. Zhu, X. Zhou, R. Lo, A. Sridhar, X. Cheng, Y . Bisk, D. Fried, U. Alon, and G. Neubig. WebArena: A realistic web environment for building au- tonomous agents. InInternational Confer- ence on Learning Representations (ICLR),
-
[16]
URL https://arxiv.org/ abs/2307.13854. 12
work page internal anchor Pith review arXiv
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.