pith. machine review for the scientific record. sign in

arxiv: 2603.06847 · v2 · submitted 2026-03-06 · 💻 cs.SE

Recognition: no theorem link

Characterizing Faults in Agentic AI: A Taxonomy of Types, Symptoms, and Root Causes

Authors on Pith no claims yet

Pith reviewed 2026-05-15 14:39 UTC · model grok-4.3

classification 💻 cs.SE
keywords agentic AIfault taxonomyLLM-based systemsroot cause analysisempirical studygrounded theorysoftware faults
0
0 comments X

The pith

Agentic AI systems have 34 fault types organized into four architectural dimensions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper conducts an empirical study to characterize faults in agentic AI systems that combine LLM-based reasoning with tool invocation and external interactions. Researchers collected 13,602 issues and pull requests from 40 repositories, used stratified sampling to select 385 faults, and applied grounded theory to derive taxonomies of fault types, symptoms, and root causes. The resulting classification groups the 34 types into four architectural dimensions and links them to symptoms such as failures in structured-output interpretation and tool calls, with root causes including data schema mismatches and state management complexity. A developer study with 145 practitioners validated the taxonomy as representative while suggesting additions for multi-agent coordination. The work supplies an empirical foundation for diagnosing and mitigating faults to improve system reliability.

Core claim

Our analysis produced a taxonomy of 34 fault types, organized into four architectural dimensions. These faults manifested as failures in structured-output interpretation, tool calls, runtime execution, and exception handling, with root causes including data schema mismatches, dependency drift, state management complexity, and model interface instability. Association rules showed recurring cross-component propagation linking structured data, dependency, and state management faults to their symptoms and root causes.

What carries the argument

A taxonomy of 34 fault types derived via grounded theory from sampled issues and pull requests, organized into four architectural dimensions of agentic AI systems.

Load-bearing premise

The 385 sampled faults drawn from 40 repositories via stratified sampling are representative of the full population of faults that occur in agentic AI systems in practice.

What would settle it

Finding a substantial set of faults in deployed agentic AI systems that match none of the 34 types or fall outside the four architectural dimensions would show the taxonomy is incomplete.

Figures

Figures reproduced from arXiv: 2603.06847 by Foutse Khomh, Mehil B Shah, Mohammad Masudur Rahman, Mohammad Mehdi Morovati.

Figure 1
Figure 1. Figure 1: Schematic overview of our empirical study workflow. [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Taxonomy of Bugs in Agentic Systems. The numbers in parentheses indicate the frequency of faults. [PITH_FULL_IMAGE:figures/full_fig_p010_2.png] view at source ↗
read the original abstract

Agentic AI systems combine LLM-based reasoning, orchestration, tool invocation, and interaction with external environments. These systems introduce faults that are difficult to characterize using existing taxonomies. To address this gap, we present an empirical study of faults in agentic AI systems. We collected 13,602 issues and pull requests from 40 repositories and, using stratified sampling, selected 385 faults for analysis. Through grounded theory, we derived taxonomies of fault types, symptoms, and root causes. We then used Apriori-based association rule mining to identify relationships among faults, symptoms, and root causes, and validated the taxonomy through a developer study with 145 practitioners. Our analysis produced a taxonomy of 34 fault types, organized into four architectural dimensions. These faults manifested as failures in structured-output interpretation, tool calls, runtime execution, and exception handling, with root causes including data schema mismatches, dependency drift, state management complexity, and model interface instability. Furthermore, association rules showed recurring cross-component propagation, linking structured data, dependency, and state management faults to their symptoms and root causes. Practitioners considered the taxonomy representative of agentic AI failures and suggested refinements related to multi-agent coordination and observability. These findings provide an empirical basis for diagnosing faults and improving reliability in agentic AI systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript reports an empirical study collecting 13,602 issues and pull requests from 40 GitHub repositories on agentic AI systems. Using stratified sampling, the authors analyze 385 faults via grounded theory to derive taxonomies of 34 fault types, symptoms, and root causes organized into four architectural dimensions. They apply Apriori association rule mining to identify relationships and validate the taxonomy with a survey of 145 practitioners. The central claim is that this taxonomy characterizes faults in structured-output interpretation, tool calls, runtime execution, and exception handling, with root causes such as data schema mismatches, dependency drift, state management complexity, and model interface instability.

Significance. If the derived taxonomy generalizes, it would offer a useful empirical foundation for diagnosing and mitigating faults specific to agentic AI architectures that existing software engineering taxonomies do not adequately cover. The integration of grounded theory with Apriori mining and practitioner validation adds practical value, and the explicit identification of cross-component propagation patterns is a constructive contribution to reliability engineering for LLM-orchestrated systems.

major comments (3)
  1. [Methods] Methods section (data collection and sampling): The generality of the four architectural dimensions and 34 fault types rests on the claim that the 40 repositories and stratified sample of 385 faults are representative of agentic AI systems in practice. No details are given on repository selection criteria (e.g., stars, activity thresholds, framework diversity), stratification variables, or any external benchmark against production logs or closed-source systems, leaving the sampling coverage unverified.
  2. [Analysis] Analysis section (grounded theory): The derivation of fault types, symptoms, and root causes lacks any reported inter-rater reliability metric (e.g., Cohen's kappa or percentage agreement) or description of how coding disagreements were resolved. Without these, the stability of the 34-type taxonomy cannot be assessed.
  3. [Results] Results section (Apriori mining): The association rules linking faults, symptoms, and root causes are presented without the support, confidence, or lift thresholds used, or any statistical significance testing. This makes it impossible to evaluate whether the reported recurring cross-component propagations are robust or artifacts of the sample.
minor comments (2)
  1. [Abstract] The abstract lists four architectural dimensions but does not name them explicitly; adding the names (e.g., reasoning, orchestration, tool-use, environment interaction) would improve clarity.
  2. [Validation] The practitioner validation is described only at a high level; reporting response rate, demographic breakdown, or specific agreement percentages with the taxonomy would strengthen the validation claim.

Simulated Author's Rebuttal

3 responses · 1 unresolved

We thank the referee for the insightful and constructive comments on our manuscript. We address each major comment point by point below, explaining our position and the changes we will make in the revised version.

read point-by-point responses
  1. Referee: [Methods] Methods section (data collection and sampling): The generality of the four architectural dimensions and 34 fault types rests on the claim that the 40 repositories and stratified sample of 385 faults are representative of agentic AI systems in practice. No details are given on repository selection criteria (e.g., stars, activity thresholds, framework diversity), stratification variables, or any external benchmark against production logs or closed-source systems, leaving the sampling coverage unverified.

    Authors: We agree that explicit details on repository selection and sampling strategy are required to support claims of representativeness. In the revised Methods section we will add a dedicated subsection specifying the selection criteria (minimum 500 GitHub stars, commits within the prior 12 months, and coverage across major frameworks such as LangChain, AutoGen, and LlamaIndex), the stratification variables (repository activity tier and primary fault category), and a limitations paragraph acknowledging that direct benchmarking against closed-source production logs was not feasible. These additions will allow readers to evaluate the sampling coverage for themselves. revision: yes

  2. Referee: [Analysis] Analysis section (grounded theory): The derivation of fault types, symptoms, and root causes lacks any reported inter-rater reliability metric (e.g., Cohen's kappa or percentage agreement) or description of how coding disagreements were resolved. Without these, the stability of the 34-type taxonomy cannot be assessed.

    Authors: We will expand the Analysis section to include inter-rater reliability metrics and a description of the disagreement-resolution process. Two authors independently coded an overlapping sample of faults; we will report the resulting agreement statistic and explain that remaining disagreements were resolved through structured discussion meetings until consensus was reached, with a third author available for arbitration. This information will enable readers to assess the stability of the derived taxonomy. revision: yes

  3. Referee: [Results] Results section (Apriori mining): The association rules linking faults, symptoms, and root causes are presented without the support, confidence, or lift thresholds used, or any statistical significance testing. This makes it impossible to evaluate whether the reported recurring cross-component propagations are robust or artifacts of the sample.

    Authors: We will revise the Results section to state the exact Apriori parameters (minimum support, confidence, and lift thresholds) that were applied and to report the statistical significance testing (including the test used and p-value threshold) performed on the discovered rules. The revised text will also include a brief justification for the chosen thresholds so that readers can evaluate the robustness of the reported cross-component propagation patterns. revision: yes

standing simulated objections not resolved
  • Direct external benchmarking of the sample against closed-source production systems or proprietary logs, which is outside the scope of an open-source GitHub-based study.

Circularity Check

0 steps flagged

No circularity: empirical taxonomy derived from independent data sources

full rationale

The paper performs an empirical study collecting 13,602 issues/PRs from 40 public GitHub repositories, applies stratified sampling to select 385 faults, uses grounded theory to derive the 34-type taxonomy across four dimensions, applies Apriori association mining, and validates via a separate survey of 145 practitioners. No equations, fitted parameters, predictions, or self-citations of prior uniqueness theorems appear in the derivation chain. The taxonomy is constructed directly from the sampled data and external practitioner feedback rather than reducing to quantities defined by the authors' own prior work or by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The taxonomy rests on the domain assumption that grounded theory applied to issue reports yields stable and useful categories for agentic AI faults, plus the assumption that the sampled repositories adequately represent current agentic systems.

axioms (1)
  • domain assumption Grounded theory applied to GitHub issues and pull requests produces a representative taxonomy of faults in agentic AI
    The paper explicitly uses grounded theory on the sampled faults to derive the 34 types.

pith-pipeline@v0.9.0 · 5546 in / 1418 out tokens · 44425 ms · 2026-05-15T14:39:08.629832+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. The Workload-Router-Pool Architecture for LLM Inference Optimization: A Vision Paper from the vLLM Semantic Router Project

    cs.LG 2026-03 unverdicted novelty 5.0

    The Workload-Router-Pool architecture is a 3D framework for LLM inference optimization that synthesizes prior vLLM work into a 3x3 interaction matrix and proposes 21 research directions at the intersections.

Reference graph

Works this paper leans on

47 extracted references · 47 canonical work pages · cited by 1 Pith paper · 4 internal anchors

  1. [1]

    ISO/IEC/IEEE International Standard - Systems and software engineering–Vocabulary.ISO/IEC/IEEE 24765:2017(E) (2017), 1–541

    2017. ISO/IEC/IEEE International Standard - Systems and software engineering–Vocabulary.ISO/IEC/IEEE 24765:2017(E) (2017), 1–541. doi:10.1109/IEEESTD.2017.8016712

  2. [2]

    Mouna Abidi, Md Saidur Rahman, Moses Openja, and Foutse Khomh. 2021. Are multi-language design smells fault-prone? An empirical study.ACM Transactions on Software Engineering and Methodology (TOSEM)30, 3 (2021), 1–56

  3. [3]

    Rakesh Agrawal and Ramakrishnan Srikant. 1994. Fast Algorithms for Mining Association Rules in Large Databases. InProceedings of the 20th International Conference on Very Large Data Bases (VLDB ’94). Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 487–499

  4. [4]

    Dario Amodei, Chris Olah, Jacob Steinhardt, Paul Christiano, John Schulman, and Dan Mané. 2016. Concrete problems in AI safety.arXiv preprint arXiv:1606.06565(2016)

  5. [5]

    BabyAGI. 2024. An experimental framework for a self-building autonomous agent. https://babyagi.org/. Accessed: 2025-11-28

  6. [6]

    Yoshua Bengio, Stephen Clare, Carina Prunkl, Maksym Andriushchenko, Ben Bucknall, Malcolm Murray, Rishi Bommasani, Stephen Casper, Tom Davidson, Raymond Douglas, David Duvenaud, Philip Fox, Usman Gohar, Rose Hadshar, Anson Ho, Tiancheng Hu, Cameron Jones, Sayash Kapoor, Atoosa Kasirzadeh, Sam Manning, Nestor Maslej, Vasilios Mavroudis, Conor McGlynn, Rich...

  7. [7]

    Harry N Boone Jr and Deborah A Boone. 2012. Analyzing likert data.The Journal of extension50, 2 (2012), 48. J. ACM, Vol. 37, No. 4, Article 111. Publication date: August 2018. 111:38 Shah et al

  8. [8]

    Mert Cemri, Melissa Z Pan, Shuyi Yang, Lakshya A Agrawal, Bhavya Chopra, Rishabh Tiwari, Kurt Keutzer, Aditya Parameswaran, Dan Klein, Kannan Ramchandran, et al. 2025. Why do multi-agent llm systems fail?arXiv preprint arXiv:2503.13657(2025)

  9. [9]

    Lee J Cronbach. 1951. Coefficient alpha and the internal structure of tests.psychometrika16, 3 (1951), 297–334

  10. [10]

    Crosby and Capers Jones

    Philip B. Crosby and Capers Jones. 2022. The Cost of Poor Software Quality in the US: A 2022 Report. https://www.it- cisq.org/the-cost-of-poor-software-quality-in-the-us/ Estimates U.S. losses from poor software quality at $2.41 trillion annually

  11. [11]

    Elena Dasseni, Vassilios S Verykios, Ahmed K Elmagarmid, and Elisa Bertino. 2001. Hiding association rules by using confidence and support. InInternational Workshop on Information Hiding. Springer, 369–383

  12. [12]

    Jessica Díaz, Jorge Pérez, Carolina Gallardo, and Ángel González-Prieto. 2023. Applying inter-rater reliability and agreement in collaborative grounded theory studies in software engineering.Journal of Systems and Software195 (2023), 111520

  13. [13]

    2017.Discovery of grounded theory: Strategies for qualitative research

    Barney Glaser and Anselm Strauss. 2017.Discovery of grounded theory: Strategies for qualitative research. Routledge

  14. [14]

    Maggie Hamill and Katerina Goseva-Popstojanova. 2009. Common trends in software fault and failure data.IEEE Transactions on Software Engineering35, 4 (2009), 484–496

  15. [15]

    2011.Data Mining: Concepts and Techniques(3rd ed.)

    Jiawei Han, Micheline Kamber, and Jian Pei. 2011.Data Mining: Concepts and Techniques(3rd ed.). Morgan Kaufmann / Elsevier, Burlington, MA. https://ia800603.us.archive.org/2/items/datamining_201811/DS-book%20u5.pdf

  16. [16]

    Mohammed Mehedi Hasan, Hao Li, Emad Fallahzadeh, Gopi Krishnan Rajbahadur, Bram Adams, and Ahmed E Hassan

  17. [17]

    An Empirical Study of Testing Practices in Open Source AI Agent Frameworks and Agentic Applications.arXiv preprint arXiv:2509.19185(2025)

  18. [18]

    Gaole He, Gianluca Demartini, and Ujwal Gadiraju. 2025. Plan-then-execute: An empirical study of user trust and team performance when using llm agents as a daily assistant. InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems. 1–22

  19. [19]

    Eirini Kalliamvakou, Georgios Gousios, Kelly Blincoe, Leif Singer, Daniel M German, and Daniela Damian. 2014. The promises and perils of mining github. InProceedings of the 11th working conference on mining software repositories. 92–101

  20. [20]

    Misoo Kim, Youngkyoung Kim, and Eunseok Lee. 2021. Denchmark: A bug benchmark of deep learning-related software. In2021 IEEE/ACM 18th International Conference on Mining Software Repositories (MSR). IEEE, 540–544

  21. [21]

    LangChain. 2024. The platform for reliable agents. https://github.com/langchain-ai/langchain. Accessed: 2025-11-28

  22. [22]

    Hao Li, Haoxiang Zhang, and Ahmed E Hassan. 2025. The rise of ai teammates in software engineering (se) 3.0: How autonomous coding agents are reshaping software engineering.arXiv preprint arXiv:2507.15003(2025)

  23. [23]

    Xixun Lin, Yucheng Ning, Jingwen Zhang, Yan Dong, Yilong Liu, Yongxuan Wu, Xiaohua Qi, Nan Sun, Yanmin Shang, Kun Wang, et al. 2025. Llm-based agents suffer from hallucinations: A survey of taxonomy, methods, and directions. arXiv preprint arXiv:2509.18970(2025)

  24. [24]

    Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, et al. 2023. Agentbench: Evaluating llms as agents.arXiv preprint arXiv:2308.03688(2023)

  25. [25]

    Ruofan Lu, Yichen Li, and Yintong Huo. 2025. Exploring Autonomous Agents: A Closer Look at Why They Fail When Completing Tasks. arXiv:2508.13143 [cs.AI] https://arxiv.org/abs/2508.13143

  26. [26]

    Xuyan Ma, Xiaofei Xie, Yawen Wang, Junjie Wang, Boyu Wu, Mingyang Li, and Qing Wang. 2025. Diagnosing Failure Root Causes in Platform-Orchestrated Agentic Systems: Dataset, Taxonomy, and Benchmark.arXiv preprint arXiv:2509.23735(2025)

  27. [27]

    Mary L McHugh. 2012. Interrater reliability: the kappa statistic.Biochemia medica22, 3 (2012), 276–282

  28. [28]

    Microsoft. 2024. A programming framework for agentic AI. https://github.com/microsoft/autogen. Accessed: 2025-11-28

  29. [29]

    Mohammad Mehdi Morovati, Amin Nikanjam, Foutse Khomh, and Zhen Ming Jiang. 2023. Bugs in machine learning- based systems: a faultload benchmark.Empirical Software Engineering28, 3 (2023), 62

  30. [30]

    Mohammad Mehdi Morovati, Amin Nikanjam, Florian Tambon, Foutse Khomh, and Zhen Ming Jiang. 2024. Bug characterization in machine learning-based systems.Empirical Software Engineering29, 1 (2024), 14

  31. [31]

    Nuthan Munaiah, Steven Kroh, Craig Cabrey, and Meiyappan Nagappan. 2017. Curating github for engineered software projects.Empirical Software Engineering22, 6 (2017), 3219–3253

  32. [32]

    Geoff Norman. 2010. Likert scales, levels of measurement and the “laws” of statistics.Advances in health sciences education15, 5 (2010), 625–632

  33. [33]

    OpenAI. 2025. Introducing GPT-4.1 in the API. https://openai.com/index/gpt-4-1/. Accessed: 2025-11-28

  34. [34]

    OWASP Foundation. 2025. OWASP Top 10 for Large Language Model Applications. https://owasp.org/www-project- top-10-for-large-language-model-applications/

  35. [35]

    Alfin Wijaya Rahardja, Junwei Liu, Weitong Chen, Zhenpeng Chen, and Yiling Lou. 2025. Can Agents Fix Agent Issues?arXiv preprint arXiv:2505.20749(2025). J. ACM, Vol. 37, No. 4, Article 111. Publication date: August 2018. Characterizing Faults in Agentic AI: A Taxonomy of Types, Symptoms, and Root Causes 111:39

  36. [36]

    Sebastian Raschka. 2018. MLxtend: Providing machine learning and data science utilities and extensions. https: //rasbt.github.io/mlxtend/. Version 0.14.0

  37. [37]

    2012.Case study research in software engineering: Guidelines and examples

    Per Runeson, Martin Host, Austen Rainer, and Bjorn Regnell. 2012.Case study research in software engineering: Guidelines and examples. John Wiley & Sons

  38. [38]

    Mehil Shah. 2026. Replication Package. https://github.com/mehilshah/Faults-in-Agentic-AI-Replication-Package. Replication package

  39. [39]

    Mehil B Shah, Mohammad Masudur Rahman, and Foutse Khomh. 2025. Towards enhancing the reproducibility of deep learning bugs: an empirical study.Empirical Software Engineering30, 1 (2025), 23

  40. [40]

    Gregory Tassey. 2002. The Economic Impacts of Inadequate Infrastructure for Software Testing. Classic estimate of $59.5 billion annual cost of software bugs in the U.S

  41. [41]

    Mohsen Tavakol and Reg Dennick. 2011. Making sense of Cronbach’s alpha.International journal of medical education 2 (2011), 53

  42. [42]

    Lei Wang, Chen Ma, Xueyang Feng, Zeyu Zhang, Hao Yang, Jingsen Zhang, Zhiyuan Chen, Jiakai Tang, Xu Chen, Yankai Lin, et al. 2024. A survey on large language model based autonomous agents.Frontiers of Computer Science18, 6 (2024), 186345

  43. [43]

    2012.Experimentation in software engineering

    Claes Wohlin, Per Runeson, Martin Höst, Magnus C Ohlsson, Björn Regnell, Anders Wesslén, et al. 2012.Experimentation in software engineering. Vol. 236. Springer

  44. [44]

    Zhiheng Xi, Wenxiang Chen, Xin Guo, Wei He, Yiwen Ding, Boyang Hong, Ming Zhang, Junzhe Wang, Senjie Jin, Enyu Zhou, et al. 2025. The rise and potential of large language model based agents: A survey.Science China Information Sciences68, 2 (2025), 121101

  45. [45]

    Shaokun Zhang, Ming Yin, Jieyu Zhang, Jiale Liu, Zhiguang Han, Jingyang Zhang, Beibin Li, Chi Wang, Huazheng Wang, Yiran Chen, et al. 2025. Which agent causes task failures and when? on automated failure attribution of llm multi-agent systems.arXiv preprint arXiv:2505.00212(2025)

  46. [46]

    Ziyao Zhang, Chong Wang, Yanlin Wang, Ensheng Shi, Yuchi Ma, Wanjun Zhong, Jiachi Chen, Mingzhi Mao, and Zibin Zheng. 2025. Llm hallucinations in practical code generation: Phenomena, mechanism, and mitigation.Proceedings of the ACM on Software Engineering2, ISSTA (2025), 481–503

  47. [47]

    Kunlun Zhu, Zijia Liu, Bingxuan Li, Muxin Tian, Yingxuan Yang, Jiaxun Zhang, Pengrui Han, Qipeng Xie, Fuyang Cui, Weijia Zhang, et al . 2025. Where LLM Agents Fail and How They can Learn From Failures.arXiv preprint arXiv:2509.25370(2025). Received 20 February 2007; revised 12 March 2009; accepted 5 June 2009 J. ACM, Vol. 37, No. 4, Article 111. Publicati...