Recognition: no theorem link
Characterizing Faults in Agentic AI: A Taxonomy of Types, Symptoms, and Root Causes
Pith reviewed 2026-05-15 14:39 UTC · model grok-4.3
The pith
Agentic AI systems have 34 fault types organized into four architectural dimensions.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Our analysis produced a taxonomy of 34 fault types, organized into four architectural dimensions. These faults manifested as failures in structured-output interpretation, tool calls, runtime execution, and exception handling, with root causes including data schema mismatches, dependency drift, state management complexity, and model interface instability. Association rules showed recurring cross-component propagation linking structured data, dependency, and state management faults to their symptoms and root causes.
What carries the argument
A taxonomy of 34 fault types derived via grounded theory from sampled issues and pull requests, organized into four architectural dimensions of agentic AI systems.
Load-bearing premise
The 385 sampled faults drawn from 40 repositories via stratified sampling are representative of the full population of faults that occur in agentic AI systems in practice.
What would settle it
Finding a substantial set of faults in deployed agentic AI systems that match none of the 34 types or fall outside the four architectural dimensions would show the taxonomy is incomplete.
Figures
read the original abstract
Agentic AI systems combine LLM-based reasoning, orchestration, tool invocation, and interaction with external environments. These systems introduce faults that are difficult to characterize using existing taxonomies. To address this gap, we present an empirical study of faults in agentic AI systems. We collected 13,602 issues and pull requests from 40 repositories and, using stratified sampling, selected 385 faults for analysis. Through grounded theory, we derived taxonomies of fault types, symptoms, and root causes. We then used Apriori-based association rule mining to identify relationships among faults, symptoms, and root causes, and validated the taxonomy through a developer study with 145 practitioners. Our analysis produced a taxonomy of 34 fault types, organized into four architectural dimensions. These faults manifested as failures in structured-output interpretation, tool calls, runtime execution, and exception handling, with root causes including data schema mismatches, dependency drift, state management complexity, and model interface instability. Furthermore, association rules showed recurring cross-component propagation, linking structured data, dependency, and state management faults to their symptoms and root causes. Practitioners considered the taxonomy representative of agentic AI failures and suggested refinements related to multi-agent coordination and observability. These findings provide an empirical basis for diagnosing faults and improving reliability in agentic AI systems.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript reports an empirical study collecting 13,602 issues and pull requests from 40 GitHub repositories on agentic AI systems. Using stratified sampling, the authors analyze 385 faults via grounded theory to derive taxonomies of 34 fault types, symptoms, and root causes organized into four architectural dimensions. They apply Apriori association rule mining to identify relationships and validate the taxonomy with a survey of 145 practitioners. The central claim is that this taxonomy characterizes faults in structured-output interpretation, tool calls, runtime execution, and exception handling, with root causes such as data schema mismatches, dependency drift, state management complexity, and model interface instability.
Significance. If the derived taxonomy generalizes, it would offer a useful empirical foundation for diagnosing and mitigating faults specific to agentic AI architectures that existing software engineering taxonomies do not adequately cover. The integration of grounded theory with Apriori mining and practitioner validation adds practical value, and the explicit identification of cross-component propagation patterns is a constructive contribution to reliability engineering for LLM-orchestrated systems.
major comments (3)
- [Methods] Methods section (data collection and sampling): The generality of the four architectural dimensions and 34 fault types rests on the claim that the 40 repositories and stratified sample of 385 faults are representative of agentic AI systems in practice. No details are given on repository selection criteria (e.g., stars, activity thresholds, framework diversity), stratification variables, or any external benchmark against production logs or closed-source systems, leaving the sampling coverage unverified.
- [Analysis] Analysis section (grounded theory): The derivation of fault types, symptoms, and root causes lacks any reported inter-rater reliability metric (e.g., Cohen's kappa or percentage agreement) or description of how coding disagreements were resolved. Without these, the stability of the 34-type taxonomy cannot be assessed.
- [Results] Results section (Apriori mining): The association rules linking faults, symptoms, and root causes are presented without the support, confidence, or lift thresholds used, or any statistical significance testing. This makes it impossible to evaluate whether the reported recurring cross-component propagations are robust or artifacts of the sample.
minor comments (2)
- [Abstract] The abstract lists four architectural dimensions but does not name them explicitly; adding the names (e.g., reasoning, orchestration, tool-use, environment interaction) would improve clarity.
- [Validation] The practitioner validation is described only at a high level; reporting response rate, demographic breakdown, or specific agreement percentages with the taxonomy would strengthen the validation claim.
Simulated Author's Rebuttal
We thank the referee for the insightful and constructive comments on our manuscript. We address each major comment point by point below, explaining our position and the changes we will make in the revised version.
read point-by-point responses
-
Referee: [Methods] Methods section (data collection and sampling): The generality of the four architectural dimensions and 34 fault types rests on the claim that the 40 repositories and stratified sample of 385 faults are representative of agentic AI systems in practice. No details are given on repository selection criteria (e.g., stars, activity thresholds, framework diversity), stratification variables, or any external benchmark against production logs or closed-source systems, leaving the sampling coverage unverified.
Authors: We agree that explicit details on repository selection and sampling strategy are required to support claims of representativeness. In the revised Methods section we will add a dedicated subsection specifying the selection criteria (minimum 500 GitHub stars, commits within the prior 12 months, and coverage across major frameworks such as LangChain, AutoGen, and LlamaIndex), the stratification variables (repository activity tier and primary fault category), and a limitations paragraph acknowledging that direct benchmarking against closed-source production logs was not feasible. These additions will allow readers to evaluate the sampling coverage for themselves. revision: yes
-
Referee: [Analysis] Analysis section (grounded theory): The derivation of fault types, symptoms, and root causes lacks any reported inter-rater reliability metric (e.g., Cohen's kappa or percentage agreement) or description of how coding disagreements were resolved. Without these, the stability of the 34-type taxonomy cannot be assessed.
Authors: We will expand the Analysis section to include inter-rater reliability metrics and a description of the disagreement-resolution process. Two authors independently coded an overlapping sample of faults; we will report the resulting agreement statistic and explain that remaining disagreements were resolved through structured discussion meetings until consensus was reached, with a third author available for arbitration. This information will enable readers to assess the stability of the derived taxonomy. revision: yes
-
Referee: [Results] Results section (Apriori mining): The association rules linking faults, symptoms, and root causes are presented without the support, confidence, or lift thresholds used, or any statistical significance testing. This makes it impossible to evaluate whether the reported recurring cross-component propagations are robust or artifacts of the sample.
Authors: We will revise the Results section to state the exact Apriori parameters (minimum support, confidence, and lift thresholds) that were applied and to report the statistical significance testing (including the test used and p-value threshold) performed on the discovered rules. The revised text will also include a brief justification for the chosen thresholds so that readers can evaluate the robustness of the reported cross-component propagation patterns. revision: yes
- Direct external benchmarking of the sample against closed-source production systems or proprietary logs, which is outside the scope of an open-source GitHub-based study.
Circularity Check
No circularity: empirical taxonomy derived from independent data sources
full rationale
The paper performs an empirical study collecting 13,602 issues/PRs from 40 public GitHub repositories, applies stratified sampling to select 385 faults, uses grounded theory to derive the 34-type taxonomy across four dimensions, applies Apriori association mining, and validates via a separate survey of 145 practitioners. No equations, fitted parameters, predictions, or self-citations of prior uniqueness theorems appear in the derivation chain. The taxonomy is constructed directly from the sampled data and external practitioner feedback rather than reducing to quantities defined by the authors' own prior work or by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Grounded theory applied to GitHub issues and pull requests produces a representative taxonomy of faults in agentic AI
Forward citations
Cited by 1 Pith paper
-
The Workload-Router-Pool Architecture for LLM Inference Optimization: A Vision Paper from the vLLM Semantic Router Project
The Workload-Router-Pool architecture is a 3D framework for LLM inference optimization that synthesizes prior vLLM work into a 3x3 interaction matrix and proposes 21 research directions at the intersections.
Reference graph
Works this paper leans on
-
[1]
2017. ISO/IEC/IEEE International Standard - Systems and software engineering–Vocabulary.ISO/IEC/IEEE 24765:2017(E) (2017), 1–541. doi:10.1109/IEEESTD.2017.8016712
-
[2]
Mouna Abidi, Md Saidur Rahman, Moses Openja, and Foutse Khomh. 2021. Are multi-language design smells fault-prone? An empirical study.ACM Transactions on Software Engineering and Methodology (TOSEM)30, 3 (2021), 1–56
work page 2021
-
[3]
Rakesh Agrawal and Ramakrishnan Srikant. 1994. Fast Algorithms for Mining Association Rules in Large Databases. InProceedings of the 20th International Conference on Very Large Data Bases (VLDB ’94). Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 487–499
work page 1994
-
[4]
Dario Amodei, Chris Olah, Jacob Steinhardt, Paul Christiano, John Schulman, and Dan Mané. 2016. Concrete problems in AI safety.arXiv preprint arXiv:1606.06565(2016)
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[5]
BabyAGI. 2024. An experimental framework for a self-building autonomous agent. https://babyagi.org/. Accessed: 2025-11-28
work page 2024
-
[6]
Yoshua Bengio, Stephen Clare, Carina Prunkl, Maksym Andriushchenko, Ben Bucknall, Malcolm Murray, Rishi Bommasani, Stephen Casper, Tom Davidson, Raymond Douglas, David Duvenaud, Philip Fox, Usman Gohar, Rose Hadshar, Anson Ho, Tiancheng Hu, Cameron Jones, Sayash Kapoor, Atoosa Kasirzadeh, Sam Manning, Nestor Maslej, Vasilios Mavroudis, Conor McGlynn, Rich...
-
[7]
Harry N Boone Jr and Deborah A Boone. 2012. Analyzing likert data.The Journal of extension50, 2 (2012), 48. J. ACM, Vol. 37, No. 4, Article 111. Publication date: August 2018. 111:38 Shah et al
work page 2012
-
[8]
Mert Cemri, Melissa Z Pan, Shuyi Yang, Lakshya A Agrawal, Bhavya Chopra, Rishabh Tiwari, Kurt Keutzer, Aditya Parameswaran, Dan Klein, Kannan Ramchandran, et al. 2025. Why do multi-agent llm systems fail?arXiv preprint arXiv:2503.13657(2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[9]
Lee J Cronbach. 1951. Coefficient alpha and the internal structure of tests.psychometrika16, 3 (1951), 297–334
work page 1951
-
[10]
Philip B. Crosby and Capers Jones. 2022. The Cost of Poor Software Quality in the US: A 2022 Report. https://www.it- cisq.org/the-cost-of-poor-software-quality-in-the-us/ Estimates U.S. losses from poor software quality at $2.41 trillion annually
work page 2022
-
[11]
Elena Dasseni, Vassilios S Verykios, Ahmed K Elmagarmid, and Elisa Bertino. 2001. Hiding association rules by using confidence and support. InInternational Workshop on Information Hiding. Springer, 369–383
work page 2001
-
[12]
Jessica Díaz, Jorge Pérez, Carolina Gallardo, and Ángel González-Prieto. 2023. Applying inter-rater reliability and agreement in collaborative grounded theory studies in software engineering.Journal of Systems and Software195 (2023), 111520
work page 2023
-
[13]
2017.Discovery of grounded theory: Strategies for qualitative research
Barney Glaser and Anselm Strauss. 2017.Discovery of grounded theory: Strategies for qualitative research. Routledge
work page 2017
-
[14]
Maggie Hamill and Katerina Goseva-Popstojanova. 2009. Common trends in software fault and failure data.IEEE Transactions on Software Engineering35, 4 (2009), 484–496
work page 2009
-
[15]
2011.Data Mining: Concepts and Techniques(3rd ed.)
Jiawei Han, Micheline Kamber, and Jian Pei. 2011.Data Mining: Concepts and Techniques(3rd ed.). Morgan Kaufmann / Elsevier, Burlington, MA. https://ia800603.us.archive.org/2/items/datamining_201811/DS-book%20u5.pdf
work page 2011
-
[16]
Mohammed Mehedi Hasan, Hao Li, Emad Fallahzadeh, Gopi Krishnan Rajbahadur, Bram Adams, and Ahmed E Hassan
-
[17]
An Empirical Study of Testing Practices in Open Source AI Agent Frameworks and Agentic Applications.arXiv preprint arXiv:2509.19185(2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[18]
Gaole He, Gianluca Demartini, and Ujwal Gadiraju. 2025. Plan-then-execute: An empirical study of user trust and team performance when using llm agents as a daily assistant. InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems. 1–22
work page 2025
-
[19]
Eirini Kalliamvakou, Georgios Gousios, Kelly Blincoe, Leif Singer, Daniel M German, and Daniela Damian. 2014. The promises and perils of mining github. InProceedings of the 11th working conference on mining software repositories. 92–101
work page 2014
-
[20]
Misoo Kim, Youngkyoung Kim, and Eunseok Lee. 2021. Denchmark: A bug benchmark of deep learning-related software. In2021 IEEE/ACM 18th International Conference on Mining Software Repositories (MSR). IEEE, 540–544
work page 2021
-
[21]
LangChain. 2024. The platform for reliable agents. https://github.com/langchain-ai/langchain. Accessed: 2025-11-28
work page 2024
- [22]
- [23]
-
[24]
Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, et al. 2023. Agentbench: Evaluating llms as agents.arXiv preprint arXiv:2308.03688(2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
- [25]
- [26]
-
[27]
Mary L McHugh. 2012. Interrater reliability: the kappa statistic.Biochemia medica22, 3 (2012), 276–282
work page 2012
-
[28]
Microsoft. 2024. A programming framework for agentic AI. https://github.com/microsoft/autogen. Accessed: 2025-11-28
work page 2024
-
[29]
Mohammad Mehdi Morovati, Amin Nikanjam, Foutse Khomh, and Zhen Ming Jiang. 2023. Bugs in machine learning- based systems: a faultload benchmark.Empirical Software Engineering28, 3 (2023), 62
work page 2023
-
[30]
Mohammad Mehdi Morovati, Amin Nikanjam, Florian Tambon, Foutse Khomh, and Zhen Ming Jiang. 2024. Bug characterization in machine learning-based systems.Empirical Software Engineering29, 1 (2024), 14
work page 2024
-
[31]
Nuthan Munaiah, Steven Kroh, Craig Cabrey, and Meiyappan Nagappan. 2017. Curating github for engineered software projects.Empirical Software Engineering22, 6 (2017), 3219–3253
work page 2017
-
[32]
Geoff Norman. 2010. Likert scales, levels of measurement and the “laws” of statistics.Advances in health sciences education15, 5 (2010), 625–632
work page 2010
-
[33]
OpenAI. 2025. Introducing GPT-4.1 in the API. https://openai.com/index/gpt-4-1/. Accessed: 2025-11-28
work page 2025
-
[34]
OWASP Foundation. 2025. OWASP Top 10 for Large Language Model Applications. https://owasp.org/www-project- top-10-for-large-language-model-applications/
work page 2025
-
[35]
Alfin Wijaya Rahardja, Junwei Liu, Weitong Chen, Zhenpeng Chen, and Yiling Lou. 2025. Can Agents Fix Agent Issues?arXiv preprint arXiv:2505.20749(2025). J. ACM, Vol. 37, No. 4, Article 111. Publication date: August 2018. Characterizing Faults in Agentic AI: A Taxonomy of Types, Symptoms, and Root Causes 111:39
-
[36]
Sebastian Raschka. 2018. MLxtend: Providing machine learning and data science utilities and extensions. https: //rasbt.github.io/mlxtend/. Version 0.14.0
work page 2018
-
[37]
2012.Case study research in software engineering: Guidelines and examples
Per Runeson, Martin Host, Austen Rainer, and Bjorn Regnell. 2012.Case study research in software engineering: Guidelines and examples. John Wiley & Sons
work page 2012
-
[38]
Mehil Shah. 2026. Replication Package. https://github.com/mehilshah/Faults-in-Agentic-AI-Replication-Package. Replication package
work page 2026
-
[39]
Mehil B Shah, Mohammad Masudur Rahman, and Foutse Khomh. 2025. Towards enhancing the reproducibility of deep learning bugs: an empirical study.Empirical Software Engineering30, 1 (2025), 23
work page 2025
-
[40]
Gregory Tassey. 2002. The Economic Impacts of Inadequate Infrastructure for Software Testing. Classic estimate of $59.5 billion annual cost of software bugs in the U.S
work page 2002
-
[41]
Mohsen Tavakol and Reg Dennick. 2011. Making sense of Cronbach’s alpha.International journal of medical education 2 (2011), 53
work page 2011
-
[42]
Lei Wang, Chen Ma, Xueyang Feng, Zeyu Zhang, Hao Yang, Jingsen Zhang, Zhiyuan Chen, Jiakai Tang, Xu Chen, Yankai Lin, et al. 2024. A survey on large language model based autonomous agents.Frontiers of Computer Science18, 6 (2024), 186345
work page 2024
-
[43]
2012.Experimentation in software engineering
Claes Wohlin, Per Runeson, Martin Höst, Magnus C Ohlsson, Björn Regnell, Anders Wesslén, et al. 2012.Experimentation in software engineering. Vol. 236. Springer
work page 2012
-
[44]
Zhiheng Xi, Wenxiang Chen, Xin Guo, Wei He, Yiwen Ding, Boyang Hong, Ming Zhang, Junzhe Wang, Senjie Jin, Enyu Zhou, et al. 2025. The rise and potential of large language model based agents: A survey.Science China Information Sciences68, 2 (2025), 121101
work page 2025
-
[45]
Shaokun Zhang, Ming Yin, Jieyu Zhang, Jiale Liu, Zhiguang Han, Jingyang Zhang, Beibin Li, Chi Wang, Huazheng Wang, Yiran Chen, et al. 2025. Which agent causes task failures and when? on automated failure attribution of llm multi-agent systems.arXiv preprint arXiv:2505.00212(2025)
-
[46]
Ziyao Zhang, Chong Wang, Yanlin Wang, Ensheng Shi, Yuchi Ma, Wanjun Zhong, Jiachi Chen, Mingzhi Mao, and Zibin Zheng. 2025. Llm hallucinations in practical code generation: Phenomena, mechanism, and mitigation.Proceedings of the ACM on Software Engineering2, ISSTA (2025), 481–503
work page 2025
-
[47]
Kunlun Zhu, Zijia Liu, Bingxuan Li, Muxin Tian, Yingxuan Yang, Jiaxun Zhang, Pengrui Han, Qipeng Xie, Fuyang Cui, Weijia Zhang, et al . 2025. Where LLM Agents Fail and How They can Learn From Failures.arXiv preprint arXiv:2509.25370(2025). Received 20 February 2007; revised 12 March 2009; accepted 5 June 2009 J. ACM, Vol. 37, No. 4, Article 111. Publicati...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.