arxiv: 2603.06847 · v2 · submitted 2026-03-06 · 💻 cs.SE

Recognition: no theorem link

Characterizing Faults in Agentic AI: A Taxonomy of Types, Symptoms, and Root Causes

Mehil B Shah , Mohammad Mehdi Morovati , Mohammad Masudur Rahman , Foutse Khomh

Authors on Pith no claims yet

Pith reviewed 2026-05-15 14:39 UTC · model grok-4.3

classification 💻 cs.SE

keywords agentic AIfault taxonomyLLM-based systemsroot cause analysisempirical studygrounded theorysoftware faults

0 comments

The pith

Agentic AI systems have 34 fault types organized into four architectural dimensions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper conducts an empirical study to characterize faults in agentic AI systems that combine LLM-based reasoning with tool invocation and external interactions. Researchers collected 13,602 issues and pull requests from 40 repositories, used stratified sampling to select 385 faults, and applied grounded theory to derive taxonomies of fault types, symptoms, and root causes. The resulting classification groups the 34 types into four architectural dimensions and links them to symptoms such as failures in structured-output interpretation and tool calls, with root causes including data schema mismatches and state management complexity. A developer study with 145 practitioners validated the taxonomy as representative while suggesting additions for multi-agent coordination. The work supplies an empirical foundation for diagnosing and mitigating faults to improve system reliability.

Core claim

Our analysis produced a taxonomy of 34 fault types, organized into four architectural dimensions. These faults manifested as failures in structured-output interpretation, tool calls, runtime execution, and exception handling, with root causes including data schema mismatches, dependency drift, state management complexity, and model interface instability. Association rules showed recurring cross-component propagation linking structured data, dependency, and state management faults to their symptoms and root causes.

What carries the argument

A taxonomy of 34 fault types derived via grounded theory from sampled issues and pull requests, organized into four architectural dimensions of agentic AI systems.

Load-bearing premise

The 385 sampled faults drawn from 40 repositories via stratified sampling are representative of the full population of faults that occur in agentic AI systems in practice.

What would settle it

Finding a substantial set of faults in deployed agentic AI systems that match none of the 34 types or fall outside the four architectural dimensions would show the taxonomy is incomplete.

Figures

Figures reproduced from arXiv: 2603.06847 by Foutse Khomh, Mehil B Shah, Mohammad Masudur Rahman, Mohammad Mehdi Morovati.

**Figure 2.** Figure 2: Taxonomy of Bugs in Agentic Systems. The numbers in parentheses indicate the frequency of faults. [PITH_FULL_IMAGE:figures/full_fig_p010_2.png] view at source ↗

read the original abstract

Agentic AI systems combine LLM-based reasoning, orchestration, tool invocation, and interaction with external environments. These systems introduce faults that are difficult to characterize using existing taxonomies. To address this gap, we present an empirical study of faults in agentic AI systems. We collected 13,602 issues and pull requests from 40 repositories and, using stratified sampling, selected 385 faults for analysis. Through grounded theory, we derived taxonomies of fault types, symptoms, and root causes. We then used Apriori-based association rule mining to identify relationships among faults, symptoms, and root causes, and validated the taxonomy through a developer study with 145 practitioners. Our analysis produced a taxonomy of 34 fault types, organized into four architectural dimensions. These faults manifested as failures in structured-output interpretation, tool calls, runtime execution, and exception handling, with root causes including data schema mismatches, dependency drift, state management complexity, and model interface instability. Furthermore, association rules showed recurring cross-component propagation, linking structured data, dependency, and state management faults to their symptoms and root causes. Practitioners considered the taxonomy representative of agentic AI failures and suggested refinements related to multi-agent coordination and observability. These findings provide an empirical basis for diagnosing faults and improving reliability in agentic AI systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a practical taxonomy of 34 agentic AI fault types from GitHub data, but its generality depends on whether the 40 repositories capture the full range of real systems.

read the letter

The main takeaway is a taxonomy of 34 fault types for agentic AI, split into four architectural dimensions, drawn from 13,602 issues and pull requests in 40 repositories. The authors sampled 385 cases, applied grounded theory to label types, symptoms, and root causes, ran Apriori mining to find associations, and checked the result with 145 practitioners. This produces concrete categories around structured output failures, tool call problems, runtime execution issues, and exception handling, with links to causes like schema mismatches, dependency drift, and state management complexity. The association rules also show how faults propagate across components. That is new for this domain and gives developers a shared way to talk about failures that standard software taxonomies miss. The scale of the initial corpus and the practitioner validation step are clear strengths; they ground the categories in actual reported problems rather than pure theory. The representativeness of the 40 repositories is the main soft spot. The paper relies on stratified sampling from public repos, but without details on selection criteria or inter-rater reliability numbers it is hard to know how much the taxonomy reflects only popular frameworks versus broader production use. The practitioner feedback is useful but comes after the fact and cannot fix coverage gaps. This work is aimed at researchers and engineers building or debugging LLM-based agents who need a diagnostic vocabulary. It is worth sending to peer review because the empirical base is honest and the topic is timely, though referees will likely press for more evidence on sampling and external checks.

Referee Report

3 major / 2 minor

Summary. The manuscript reports an empirical study collecting 13,602 issues and pull requests from 40 GitHub repositories on agentic AI systems. Using stratified sampling, the authors analyze 385 faults via grounded theory to derive taxonomies of 34 fault types, symptoms, and root causes organized into four architectural dimensions. They apply Apriori association rule mining to identify relationships and validate the taxonomy with a survey of 145 practitioners. The central claim is that this taxonomy characterizes faults in structured-output interpretation, tool calls, runtime execution, and exception handling, with root causes such as data schema mismatches, dependency drift, state management complexity, and model interface instability.

Significance. If the derived taxonomy generalizes, it would offer a useful empirical foundation for diagnosing and mitigating faults specific to agentic AI architectures that existing software engineering taxonomies do not adequately cover. The integration of grounded theory with Apriori mining and practitioner validation adds practical value, and the explicit identification of cross-component propagation patterns is a constructive contribution to reliability engineering for LLM-orchestrated systems.

major comments (3)

[Methods] Methods section (data collection and sampling): The generality of the four architectural dimensions and 34 fault types rests on the claim that the 40 repositories and stratified sample of 385 faults are representative of agentic AI systems in practice. No details are given on repository selection criteria (e.g., stars, activity thresholds, framework diversity), stratification variables, or any external benchmark against production logs or closed-source systems, leaving the sampling coverage unverified.
[Analysis] Analysis section (grounded theory): The derivation of fault types, symptoms, and root causes lacks any reported inter-rater reliability metric (e.g., Cohen's kappa or percentage agreement) or description of how coding disagreements were resolved. Without these, the stability of the 34-type taxonomy cannot be assessed.
[Results] Results section (Apriori mining): The association rules linking faults, symptoms, and root causes are presented without the support, confidence, or lift thresholds used, or any statistical significance testing. This makes it impossible to evaluate whether the reported recurring cross-component propagations are robust or artifacts of the sample.

minor comments (2)

[Abstract] The abstract lists four architectural dimensions but does not name them explicitly; adding the names (e.g., reasoning, orchestration, tool-use, environment interaction) would improve clarity.
[Validation] The practitioner validation is described only at a high level; reporting response rate, demographic breakdown, or specific agreement percentages with the taxonomy would strengthen the validation claim.

Simulated Author's Rebuttal

3 responses · 1 unresolved

We thank the referee for the insightful and constructive comments on our manuscript. We address each major comment point by point below, explaining our position and the changes we will make in the revised version.

read point-by-point responses

Referee: [Methods] Methods section (data collection and sampling): The generality of the four architectural dimensions and 34 fault types rests on the claim that the 40 repositories and stratified sample of 385 faults are representative of agentic AI systems in practice. No details are given on repository selection criteria (e.g., stars, activity thresholds, framework diversity), stratification variables, or any external benchmark against production logs or closed-source systems, leaving the sampling coverage unverified.

Authors: We agree that explicit details on repository selection and sampling strategy are required to support claims of representativeness. In the revised Methods section we will add a dedicated subsection specifying the selection criteria (minimum 500 GitHub stars, commits within the prior 12 months, and coverage across major frameworks such as LangChain, AutoGen, and LlamaIndex), the stratification variables (repository activity tier and primary fault category), and a limitations paragraph acknowledging that direct benchmarking against closed-source production logs was not feasible. These additions will allow readers to evaluate the sampling coverage for themselves. revision: yes
Referee: [Analysis] Analysis section (grounded theory): The derivation of fault types, symptoms, and root causes lacks any reported inter-rater reliability metric (e.g., Cohen's kappa or percentage agreement) or description of how coding disagreements were resolved. Without these, the stability of the 34-type taxonomy cannot be assessed.

Authors: We will expand the Analysis section to include inter-rater reliability metrics and a description of the disagreement-resolution process. Two authors independently coded an overlapping sample of faults; we will report the resulting agreement statistic and explain that remaining disagreements were resolved through structured discussion meetings until consensus was reached, with a third author available for arbitration. This information will enable readers to assess the stability of the derived taxonomy. revision: yes
Referee: [Results] Results section (Apriori mining): The association rules linking faults, symptoms, and root causes are presented without the support, confidence, or lift thresholds used, or any statistical significance testing. This makes it impossible to evaluate whether the reported recurring cross-component propagations are robust or artifacts of the sample.

Authors: We will revise the Results section to state the exact Apriori parameters (minimum support, confidence, and lift thresholds) that were applied and to report the statistical significance testing (including the test used and p-value threshold) performed on the discovered rules. The revised text will also include a brief justification for the chosen thresholds so that readers can evaluate the robustness of the reported cross-component propagation patterns. revision: yes

standing simulated objections not resolved

Direct external benchmarking of the sample against closed-source production systems or proprietary logs, which is outside the scope of an open-source GitHub-based study.

Circularity Check

0 steps flagged

No circularity: empirical taxonomy derived from independent data sources

full rationale

The paper performs an empirical study collecting 13,602 issues/PRs from 40 public GitHub repositories, applies stratified sampling to select 385 faults, uses grounded theory to derive the 34-type taxonomy across four dimensions, applies Apriori association mining, and validates via a separate survey of 145 practitioners. No equations, fitted parameters, predictions, or self-citations of prior uniqueness theorems appear in the derivation chain. The taxonomy is constructed directly from the sampled data and external practitioner feedback rather than reducing to quantities defined by the authors' own prior work or by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The taxonomy rests on the domain assumption that grounded theory applied to issue reports yields stable and useful categories for agentic AI faults, plus the assumption that the sampled repositories adequately represent current agentic systems.

axioms (1)

domain assumption Grounded theory applied to GitHub issues and pull requests produces a representative taxonomy of faults in agentic AI
The paper explicitly uses grounded theory on the sampled faults to derive the 34 types.

pith-pipeline@v0.9.0 · 5546 in / 1418 out tokens · 44425 ms · 2026-05-15T14:39:08.629832+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

The Workload-Router-Pool Architecture for LLM Inference Optimization: A Vision Paper from the vLLM Semantic Router Project
cs.LG 2026-03 unverdicted novelty 5.0

The Workload-Router-Pool architecture is a 3D framework for LLM inference optimization that synthesizes prior vLLM work into a 3x3 interaction matrix and proposes 21 research directions at the intersections.

Reference graph

Works this paper leans on

47 extracted references · 47 canonical work pages · cited by 1 Pith paper · 4 internal anchors

[1]

ISO/IEC/IEEE International Standard - Systems and software engineering–Vocabulary.ISO/IEC/IEEE 24765:2017(E) (2017), 1–541

2017. ISO/IEC/IEEE International Standard - Systems and software engineering–Vocabulary.ISO/IEC/IEEE 24765:2017(E) (2017), 1–541. doi:10.1109/IEEESTD.2017.8016712

work page doi:10.1109/ieeestd.2017.8016712 2017
[2]

Mouna Abidi, Md Saidur Rahman, Moses Openja, and Foutse Khomh. 2021. Are multi-language design smells fault-prone? An empirical study.ACM Transactions on Software Engineering and Methodology (TOSEM)30, 3 (2021), 1–56

work page 2021
[3]

Rakesh Agrawal and Ramakrishnan Srikant. 1994. Fast Algorithms for Mining Association Rules in Large Databases. InProceedings of the 20th International Conference on Very Large Data Bases (VLDB ’94). Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 487–499

work page 1994
[4]

Dario Amodei, Chris Olah, Jacob Steinhardt, Paul Christiano, John Schulman, and Dan Mané. 2016. Concrete problems in AI safety.arXiv preprint arXiv:1606.06565(2016)

work page internal anchor Pith review Pith/arXiv arXiv 2016
[5]

BabyAGI. 2024. An experimental framework for a self-building autonomous agent. https://babyagi.org/. Accessed: 2025-11-28

work page 2024
[6]

Yoshua Bengio, Stephen Clare, Carina Prunkl, Maksym Andriushchenko, Ben Bucknall, Malcolm Murray, Rishi Bommasani, Stephen Casper, Tom Davidson, Raymond Douglas, David Duvenaud, Philip Fox, Usman Gohar, Rose Hadshar, Anson Ho, Tiancheng Hu, Cameron Jones, Sayash Kapoor, Atoosa Kasirzadeh, Sam Manning, Nestor Maslej, Vasilios Mavroudis, Conor McGlynn, Rich...

work page arXiv 2026
[7]

Harry N Boone Jr and Deborah A Boone. 2012. Analyzing likert data.The Journal of extension50, 2 (2012), 48. J. ACM, Vol. 37, No. 4, Article 111. Publication date: August 2018. 111:38 Shah et al

work page 2012
[8]

Mert Cemri, Melissa Z Pan, Shuyi Yang, Lakshya A Agrawal, Bhavya Chopra, Rishabh Tiwari, Kurt Keutzer, Aditya Parameswaran, Dan Klein, Kannan Ramchandran, et al. 2025. Why do multi-agent llm systems fail?arXiv preprint arXiv:2503.13657(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[9]

Lee J Cronbach. 1951. Coefficient alpha and the internal structure of tests.psychometrika16, 3 (1951), 297–334

work page 1951
[10]

Crosby and Capers Jones

Philip B. Crosby and Capers Jones. 2022. The Cost of Poor Software Quality in the US: A 2022 Report. https://www.it- cisq.org/the-cost-of-poor-software-quality-in-the-us/ Estimates U.S. losses from poor software quality at $2.41 trillion annually

work page 2022
[11]

Elena Dasseni, Vassilios S Verykios, Ahmed K Elmagarmid, and Elisa Bertino. 2001. Hiding association rules by using confidence and support. InInternational Workshop on Information Hiding. Springer, 369–383

work page 2001
[12]

Jessica Díaz, Jorge Pérez, Carolina Gallardo, and Ángel González-Prieto. 2023. Applying inter-rater reliability and agreement in collaborative grounded theory studies in software engineering.Journal of Systems and Software195 (2023), 111520

work page 2023
[13]

2017.Discovery of grounded theory: Strategies for qualitative research

Barney Glaser and Anselm Strauss. 2017.Discovery of grounded theory: Strategies for qualitative research. Routledge

work page 2017
[14]

Maggie Hamill and Katerina Goseva-Popstojanova. 2009. Common trends in software fault and failure data.IEEE Transactions on Software Engineering35, 4 (2009), 484–496

work page 2009
[15]

2011.Data Mining: Concepts and Techniques(3rd ed.)

Jiawei Han, Micheline Kamber, and Jian Pei. 2011.Data Mining: Concepts and Techniques(3rd ed.). Morgan Kaufmann / Elsevier, Burlington, MA. https://ia800603.us.archive.org/2/items/datamining_201811/DS-book%20u5.pdf

work page 2011
[16]

Mohammed Mehedi Hasan, Hao Li, Emad Fallahzadeh, Gopi Krishnan Rajbahadur, Bram Adams, and Ahmed E Hassan

work page
[17]

An Empirical Study of Testing Practices in Open Source AI Agent Frameworks and Agentic Applications.arXiv preprint arXiv:2509.19185(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[18]

Gaole He, Gianluca Demartini, and Ujwal Gadiraju. 2025. Plan-then-execute: An empirical study of user trust and team performance when using llm agents as a daily assistant. InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems. 1–22

work page 2025
[19]

Eirini Kalliamvakou, Georgios Gousios, Kelly Blincoe, Leif Singer, Daniel M German, and Daniela Damian. 2014. The promises and perils of mining github. InProceedings of the 11th working conference on mining software repositories. 92–101

work page 2014
[20]

Misoo Kim, Youngkyoung Kim, and Eunseok Lee. 2021. Denchmark: A bug benchmark of deep learning-related software. In2021 IEEE/ACM 18th International Conference on Mining Software Repositories (MSR). IEEE, 540–544

work page 2021
[21]

LangChain. 2024. The platform for reliable agents. https://github.com/langchain-ai/langchain. Accessed: 2025-11-28

work page 2024
[22]

Hao Li, Haoxiang Zhang, and Ahmed E Hassan. 2025. The rise of ai teammates in software engineering (se) 3.0: How autonomous coding agents are reshaping software engineering.arXiv preprint arXiv:2507.15003(2025)

work page arXiv 2025
[23]

Xixun Lin, Yucheng Ning, Jingwen Zhang, Yan Dong, Yilong Liu, Yongxuan Wu, Xiaohua Qi, Nan Sun, Yanmin Shang, Kun Wang, et al. 2025. Llm-based agents suffer from hallucinations: A survey of taxonomy, methods, and directions. arXiv preprint arXiv:2509.18970(2025)

work page arXiv 2025
[24]

Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, et al. 2023. Agentbench: Evaluating llms as agents.arXiv preprint arXiv:2308.03688(2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[25]

Ruofan Lu, Yichen Li, and Yintong Huo. 2025. Exploring Autonomous Agents: A Closer Look at Why They Fail When Completing Tasks. arXiv:2508.13143 [cs.AI] https://arxiv.org/abs/2508.13143

work page arXiv 2025
[26]

Xuyan Ma, Xiaofei Xie, Yawen Wang, Junjie Wang, Boyu Wu, Mingyang Li, and Qing Wang. 2025. Diagnosing Failure Root Causes in Platform-Orchestrated Agentic Systems: Dataset, Taxonomy, and Benchmark.arXiv preprint arXiv:2509.23735(2025)

work page arXiv 2025
[27]

Mary L McHugh. 2012. Interrater reliability: the kappa statistic.Biochemia medica22, 3 (2012), 276–282

work page 2012
[28]

Microsoft. 2024. A programming framework for agentic AI. https://github.com/microsoft/autogen. Accessed: 2025-11-28

work page 2024
[29]

Mohammad Mehdi Morovati, Amin Nikanjam, Foutse Khomh, and Zhen Ming Jiang. 2023. Bugs in machine learning- based systems: a faultload benchmark.Empirical Software Engineering28, 3 (2023), 62

work page 2023
[30]

Mohammad Mehdi Morovati, Amin Nikanjam, Florian Tambon, Foutse Khomh, and Zhen Ming Jiang. 2024. Bug characterization in machine learning-based systems.Empirical Software Engineering29, 1 (2024), 14

work page 2024
[31]

Nuthan Munaiah, Steven Kroh, Craig Cabrey, and Meiyappan Nagappan. 2017. Curating github for engineered software projects.Empirical Software Engineering22, 6 (2017), 3219–3253

work page 2017
[32]

Geoff Norman. 2010. Likert scales, levels of measurement and the “laws” of statistics.Advances in health sciences education15, 5 (2010), 625–632

work page 2010
[33]

OpenAI. 2025. Introducing GPT-4.1 in the API. https://openai.com/index/gpt-4-1/. Accessed: 2025-11-28

work page 2025
[34]

OWASP Foundation. 2025. OWASP Top 10 for Large Language Model Applications. https://owasp.org/www-project- top-10-for-large-language-model-applications/

work page 2025
[35]

Alfin Wijaya Rahardja, Junwei Liu, Weitong Chen, Zhenpeng Chen, and Yiling Lou. 2025. Can Agents Fix Agent Issues?arXiv preprint arXiv:2505.20749(2025). J. ACM, Vol. 37, No. 4, Article 111. Publication date: August 2018. Characterizing Faults in Agentic AI: A Taxonomy of Types, Symptoms, and Root Causes 111:39

work page arXiv 2025
[36]

Sebastian Raschka. 2018. MLxtend: Providing machine learning and data science utilities and extensions. https: //rasbt.github.io/mlxtend/. Version 0.14.0

work page 2018
[37]

2012.Case study research in software engineering: Guidelines and examples

Per Runeson, Martin Host, Austen Rainer, and Bjorn Regnell. 2012.Case study research in software engineering: Guidelines and examples. John Wiley & Sons

work page 2012
[38]

Mehil Shah. 2026. Replication Package. https://github.com/mehilshah/Faults-in-Agentic-AI-Replication-Package. Replication package

work page 2026
[39]

Mehil B Shah, Mohammad Masudur Rahman, and Foutse Khomh. 2025. Towards enhancing the reproducibility of deep learning bugs: an empirical study.Empirical Software Engineering30, 1 (2025), 23

work page 2025
[40]

Gregory Tassey. 2002. The Economic Impacts of Inadequate Infrastructure for Software Testing. Classic estimate of $59.5 billion annual cost of software bugs in the U.S

work page 2002
[41]

Mohsen Tavakol and Reg Dennick. 2011. Making sense of Cronbach’s alpha.International journal of medical education 2 (2011), 53

work page 2011
[42]

Lei Wang, Chen Ma, Xueyang Feng, Zeyu Zhang, Hao Yang, Jingsen Zhang, Zhiyuan Chen, Jiakai Tang, Xu Chen, Yankai Lin, et al. 2024. A survey on large language model based autonomous agents.Frontiers of Computer Science18, 6 (2024), 186345

work page 2024
[43]

2012.Experimentation in software engineering

Claes Wohlin, Per Runeson, Martin Höst, Magnus C Ohlsson, Björn Regnell, Anders Wesslén, et al. 2012.Experimentation in software engineering. Vol. 236. Springer

work page 2012
[44]

Zhiheng Xi, Wenxiang Chen, Xin Guo, Wei He, Yiwen Ding, Boyang Hong, Ming Zhang, Junzhe Wang, Senjie Jin, Enyu Zhou, et al. 2025. The rise and potential of large language model based agents: A survey.Science China Information Sciences68, 2 (2025), 121101

work page 2025
[45]

Shaokun Zhang, Ming Yin, Jieyu Zhang, Jiale Liu, Zhiguang Han, Jingyang Zhang, Beibin Li, Chi Wang, Huazheng Wang, Yiran Chen, et al. 2025. Which agent causes task failures and when? on automated failure attribution of llm multi-agent systems.arXiv preprint arXiv:2505.00212(2025)

work page arXiv 2025
[46]

Ziyao Zhang, Chong Wang, Yanlin Wang, Ensheng Shi, Yuchi Ma, Wanjun Zhong, Jiachi Chen, Mingzhi Mao, and Zibin Zheng. 2025. Llm hallucinations in practical code generation: Phenomena, mechanism, and mitigation.Proceedings of the ACM on Software Engineering2, ISSTA (2025), 481–503

work page 2025
[47]

Kunlun Zhu, Zijia Liu, Bingxuan Li, Muxin Tian, Yingxuan Yang, Jiaxun Zhang, Pengrui Han, Qipeng Xie, Fuyang Cui, Weijia Zhang, et al . 2025. Where LLM Agents Fail and How They can Learn From Failures.arXiv preprint arXiv:2509.25370(2025). Received 20 February 2007; revised 12 March 2009; accepted 5 June 2009 J. ACM, Vol. 37, No. 4, Article 111. Publicati...

work page arXiv 2025