An Empirical Study of Testing Practices in Open Source AI Agent Frameworks and Agentic Applications
Pith reviewed 2026-05-18 14:08 UTC · model grok-4.3
The pith
Developers of AI agents test deterministic tools and workflows far more than the uncertain prompts and planning that define agent behavior.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By analyzing 39 open-source AI agent frameworks and 439 agentic applications, the paper identifies ten testing patterns and maps them to architectural components, revealing that Resource Artifacts and Coordination Artifacts account for over 70% of testing effort while the FM-based Plan Body receives less than 5% and the Trigger component appears in only around 1% of tests. Novel methods like DeepEval are seldom used while traditional patterns are adapted to manage FM uncertainty, providing the first empirical testing baseline and showing a rational but incomplete adaptation to non-determinism.
What carries the argument
Mapping of observed testing patterns to canonical architectural components of agent frameworks and applications, specifically Resource Artifacts (tools), Coordination Artifacts (workflows), FM-based Plan Body, and Trigger (prompts).
If this is right
- Framework developers should improve support for novel testing methods like DeepEval.
- Application developers must adopt prompt regression testing to cover the neglected Trigger component.
- Researchers should explore barriers to adoption of practices that better address non-determinism.
- Strengthening these practices is vital for building more robust and dependable AI agents.
Where Pith is reading between the lines
- The same testing imbalances could appear in closed-source or enterprise AI agent projects where data access differs.
- Better prompt testing might reduce unexpected failures when agents are deployed in variable real-world conditions.
- Automated tools for generating regression tests on prompts could help close the observed coverage gap over time.
Load-bearing premise
The 39 frameworks and 439 applications selected from open source repositories are representative of broader developer practices, and the identification of test patterns accurately captures how tests relate to architectural components without significant misclassification.
What would settle it
A replication study using a different selection of frameworks and applications that finds testing effort distributed more evenly or with substantially higher coverage for the Trigger component and Plan Body.
Figures
read the original abstract
Foundation model (FM)-based AI agents are rapidly gaining adoption across diverse domains, but their inherent non-determinism and non-reproducibility pose testing and quality assurance challenges. While recent benchmarks provide task-level evaluations, there is limited understanding of how developers verify the internal correctness of these agents during development. To address this gap, we conduct the first large-scale empirical study of testing practices in the AI agent ecosystem, analyzing 39 open-source agent frameworks and 439 agentic applications. We identify ten distinct testing patterns and find that novel, agent-specific methods like DeepEval are seldom used (around 1%), while traditional patterns like negative and membership testing are widely adapted to manage FM uncertainty. By mapping these patterns to canonical architectural components of agent frameworks and agentic applications, we uncover a fundamental inversion of testing effort: deterministic components like Resource Artifacts (tools) and Coordination Artifacts (workflows) consume over 70% of testing effort, while the FM-based Plan Body receives less than 5%. Crucially, this reveals a critical blind spot, as the Trigger component (prompts) remains neglected, appearing in around 1% of all tests. Our findings offer the first empirical testing baseline in FM-based agent frameworks and agentic applications, revealing a rational but incomplete adaptation to non-determinism. To address it, framework developers should improve support for novel testing methods, application developers must adopt prompt regression testing, and researchers should explore barriers to adoption. Strengthening these practices is vital for building more robust and dependable AI agents.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript reports the first large-scale empirical study of testing practices in the AI agent ecosystem. The authors analyze 39 open-source agent frameworks and 439 agentic applications, identify ten testing patterns, observe low adoption of novel agent-specific methods such as DeepEval (around 1%), and map observed tests to four canonical architectural components. This mapping yields the central claim of an inversion of testing effort: deterministic components (Resource Artifacts/tools and Coordination Artifacts/workflows) account for over 70% of tests, the FM-based Plan Body receives less than 5%, and the Trigger component (prompts) appears in around 1% of tests. The authors conclude that current practices represent a rational but incomplete adaptation to non-determinism.
Significance. If the mapping procedure is shown to be reliable, the work supplies a much-needed empirical baseline on how developers actually test internal correctness in FM-based agents. The scale of the corpus and the concrete identification of a prompt-testing blind spot are strengths that could usefully inform framework design and future research on agent robustness.
major comments (1)
- [Mapping procedure (Section 4)] The quantitative claims of the effort inversion (over 70% deterministic, <5% Plan Body, ~1% Trigger) rest entirely on the test-to-component mapping procedure. The manuscript describes the mapping but reports neither inter-rater reliability statistics, a double-coded subsample, nor an error analysis for ambiguous cases such as tests that exercise both a tool and a prompt. Because even moderate misclassification rates would materially affect the reported percentages, this omission is load-bearing for the central result.
minor comments (2)
- [Abstract] The abstract uses 'around 1%' for both DeepEval adoption and Trigger coverage; a single sentence distinguishing the two figures would improve clarity.
- [Results] Table or figure presenting the per-component test counts would make the 70%/5%/1% inversion easier to verify at a glance.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback and positive evaluation of the manuscript's significance. We address the single major comment below and will revise the paper accordingly to strengthen the reliability of our mapping procedure.
read point-by-point responses
-
Referee: [Mapping procedure (Section 4)] The quantitative claims of the effort inversion (over 70% deterministic, <5% Plan Body, ~1% Trigger) rest entirely on the test-to-component mapping procedure. The manuscript describes the mapping but reports neither inter-rater reliability statistics, a double-coded subsample, nor an error analysis for ambiguous cases such as tests that exercise both a tool and a prompt. Because even moderate misclassification rates would materially affect the reported percentages, this omission is load-bearing for the central result.
Authors: We agree that formal reliability assessment is important to support the central quantitative claims. The mapping was performed by the authors through iterative discussion to resolve disagreements, but the submitted manuscript does not include inter-rater reliability statistics, a double-coded subsample, or a dedicated error analysis. For the revision, we will randomly select a 10% subsample of tests (approximately 50 tests), have two independent coders perform the mapping, compute Cohen's kappa, and add both the reliability results and an error analysis (including discussion of ambiguous cases such as tests involving both tools and prompts) as a new subsection in Section 4. This will directly strengthen the validity of the reported percentages. revision: yes
Circularity Check
No circularity: empirical counts from external repositories
full rationale
The paper conducts a direct empirical study by inspecting 39 open-source agent frameworks and 439 agentic applications. Testing patterns are identified from code and test artifacts, then mapped to four architectural components (Resource Artifacts, Coordination Artifacts, Plan Body, Trigger). The reported effort shares (over 70%, less than 5%, around 1%) are simple tallies of observed test instances. No equations, fitted parameters, or self-citations are used to derive these quantities; the mapping procedure is described as manual or semi-automated inspection of external repositories. The analysis is therefore self-contained against external benchmarks and does not reduce any claim to its own inputs by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The selected open-source AI agent frameworks and applications are representative of typical development and testing practices in the ecosystem.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
By mapping these patterns to canonical architectural components of agent frameworks and agentic applications, we uncover a fundamental inversion of testing effort: deterministic components like Resource Artifacts (tools) and Coordination Artifacts (workflows) consume over 70% of testing effort, while the FM-based Plan Body receives less than 5%. Crucially, this reveals a critical blind spot, as the Trigger component (prompts) remains neglected, appearing in around 1% of all tests.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 3 Pith papers
-
Agentic Frameworks for Reasoning Tasks: An Empirical Study
An empirical evaluation of 22 agentic frameworks on BBH, GSM8K, and ARC benchmarks shows stable performance in 12 frameworks but highlights orchestration failures and weaker mathematical reasoning.
-
Dissecting Bug Triggers and Failure Modes in Modern Agentic Frameworks: An Empirical Study
Analysis of bugs in modern agentic frameworks uncovers unique symptoms like unexpected execution sequences and root causes including model faults and orchestration issues, with transferable patterns across designs.
-
Characterizing Faults in Agentic AI: A Taxonomy of Types, Symptoms, and Root Causes
An empirical study of real-world issues yields a taxonomy of 34 fault types, symptoms, and root causes in agentic AI systems, validated by 145 practitioners.
Reference graph
Works this paper leans on
-
[1]
AgentBench: Evaluating LLMs as Agents
X. Liu, H. Yu, H. Zhang, Y. Xu, X. Lei, H. Lai, Y. Gu, H. Ding, K. Men, K. Yang et al., “Agentbench: Evaluating llms as agents,”arXiv preprint arXiv:2308.03688,
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
Generative agents: Interactive simulacra of human behavior,
J. S. Park, J. O’Brien, C. J. Cai, M. R. Morris, P. Liang, and M. S. Bernstein, “Generative agents: Interactive simulacra of human behavior,” inProceedings of the 36th annual acm symposium on user interface software and technology, 2023, pp. 1–22. Q. Wu, G. Bansal, J. Zhang, Y. Wu, B. Li, E. Zhu, L. Jiang, X. Zhang, S. Zhang, J. Liuet al., “Autogen: Enabl...
work page 2023
-
[3]
WebArena: A Realistic Web Environment for Building Autonomous Agents
S. Zhou, F. F. Xu, H. Zhu, X. Zhou, R. Lo, A. Sridhar, X. Cheng, T. Ou, Y. Bisk, D. Friedet al., “Webarena: A realistic web environment for building autonomous agents,”arXiv preprint arXiv:2307.13854,
work page internal anchor Pith review Pith/arXiv arXiv
-
[4]
X. Wang, Z. Wang, J. Liu, Y. Chen, L. Yuan, H. Peng, and H. Ji, “Mint: Evaluating llms in multi-turn interaction with tools and language feedback,”arXiv preprint arXiv:2309.10691,
-
[5]
evaluating student performance
L. Weidinger, I. D. Raji, H. Wallach, M. Mitchell, A. Wang, O. Salaudeen, R. Bom- masani, D. Ganguli, S. Koyejo, and W. Isaac, “Toward an evaluation science for generative ai systems,”arXiv preprint arXiv:2503.05336,
-
[6]
Will my tests tell me if i break this code?
R. Niedermayr, E. Juergens, and S. Wagner, “Will my tests tell me if i break this code?” inProceedings of the International Workshop on Continuous Software Evo- lution and Delivery, 2016, pp. 23–29. Testing Practices in AI Agent Frameworks and Agentic Applications 43 A. E. Hassan, D. Lin, G. K. Rajbahadur, K. Gallaba, F. R. Cogo, B. Chen, H. Zhang, K. Tha...
work page 2016
-
[7]
[Online]. Available: https://arxiv.org/abs/2506.13538 A. Ehtesham, A. Singh, G. K. Gupta, and S. Kumar, “A survey of agent interop- erability protocols: Model context protocol (mcp), agent communication protocol (acp), agent-to-agent protocol (a2a), and agent network protocol (anp),”arXiv preprint arXiv:2505.02279,
work page internal anchor Pith review Pith/arXiv arXiv
-
[8]
A taxonomy for autonomous llm-powered multi-agent architectures
T. Händler, “A taxonomy for autonomous llm-powered multi-agent architectures.” inKMIS, 2023, pp. 85–98. O. Boissier, R. H. Bordini, J. Hubner, and A. Ricci,Multi-agent oriented program- ming: programming multi-agent systems using JaCaMo. Mit Press,
work page 2023
-
[9]
Exploring the composition of unit test suites,
B. Van Rompaey and S. Demeyer, “Exploring the composition of unit test suites,” in2008 23rd IEEE/ACM International Conference on Automated Soft- ware Engineering-Workshops. IEEE, 2008, pp. 11–20. Y. Tao, “An introduction to assertion-based verification,” in2009 IEEE 8th Inter- national Conference on ASIC. IEEE, 2009, pp. 1318–1323. 44 Mohammed Mehedi Hasa...
work page 2008
-
[10]
T. Händler, “Balancing autonomy and alignment: a multi-dimensional taxon- omy for autonomous llm-powered multi-agent architectures,”arXiv preprint arXiv:2310.03659,
-
[11]
The Landscape of Emerging AI Agent Architectures for Reasoning, Planning, and Tool Calling: A Survey
T. Masterman, S. Besen, M. Sawtell, and A. Chao, “The landscape of emerging ai agent architectures for reasoning, planning, and tool calling: A survey,”arXiv preprint arXiv:2404.11584,
work page internal anchor Pith review Pith/arXiv arXiv
-
[12]
Exploring large language model based intelligent agents: Definitions, methods, and prospects
Y. Cheng, C. Zhang, Z. Zhang, X. Meng, S. Hong, W. Li, Z. Wang, Z. Wang, F. Yin, J. Zhaoet al., “Exploring large language model based intelligent agents: Defini- tions, methods, and prospects,”arXiv preprint arXiv:2401.03428,
-
[13]
D. Gonzalez, J. C. Santos, A. Popovich, M. Mirakhorli, and M. Nagappan, “A large-scale study on the usage of testing patterns that address maintainability attributes: patterns for ease of modification, diagnoses, and comprehension,” in 2017 IEEE/ACM 14th International Conference on Mining Software Repositories (MSR). IEEE, 2017, pp. 391–401. C. Wei, L. Xi...
work page 2017
-
[14]
Carving parameterized unit tests,
A. Kampmann and A. Zeller, “Carving parameterized unit tests,” in2019 IEEE/ACM 41st International Conference on Software Engineering: Companion Proceedings (ICSE-Companion). IEEE, 2019, pp. 248–249. A. Fontes and G. Gay, “The integration of machine learning into automated test generation: A systematic mapping study,”Software Testing, Verification and Re- ...
work page 2019
-
[15]
Deepxplore: Automated whitebox testing of deep learning systems,
Testing Practices in AI Agent Frameworks and Agentic Applications 45 K. Pei, Y. Cao, J. Yang, and S. Jana, “Deepxplore: Automated whitebox testing of deep learning systems,” inproceedings of the 26th Symposium on Operating Systems Principles, 2017, pp. 1–18. Y. Nishi, S. Masuda, H. Ogawa, and K. Uetsuki, “A test architecture for machine learning product,”...
work page 2017
-
[16]
Large Language Model based Multi-Agents: A Survey of Progress and Challenges
T. Guo, X. Chen, Y. Wang, R. Chang, S. Pei, N. V. Chawla, O. Wiest, and X. Zhang, “Large language model based multi-agents: A survey of progress and challenges,” arXiv preprint arXiv:2402.01680,
work page internal anchor Pith review Pith/arXiv arXiv
-
[17]
AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation
Q. Wu, G. Bansal, J. Zhang, Y. Wu, S. Zhang, E. Zhu, B. Li, L. Jiang, X. Zhang, and C. Wang, “Autogen: Enabling next-gen llm applications via multi-agent con- versation framework,”arXiv preprint arXiv:2308.08155,
work page internal anchor Pith review Pith/arXiv arXiv
-
[18]
MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework
S. Hong, X. Zheng, J. Chen, Y. Cheng, J. Wang, C. Zhang, Z. Wang, S. K. S. Yau, Z. Lin, L. Zhouet al., “Metagpt: Meta programming for multi-agent collaborative framework,”arXiv preprint arXiv:2308.00352, vol. 3, no. 4, p. 6,
work page internal anchor Pith review Pith/arXiv arXiv
-
[19]
D. Gonzalez, T. Zimmermann, and N. Nagappan, “The state of the ml-universe: 10 years of artificial intelligence & machine learning software development on github,” inProceedings of the 17th International conference on mining software repositories, 2020, pp. 431–442. H. Li and C.-P. Bezemer, “Bridging the language gap: an empirical study of bindings for op...
work page 2020
-
[20]
The promises and perils of mining github,
E. Kalliamvakou, G. Gousios, K. Blincoe, L. Singer, D. M. German, and D. Damian, “The promises and perils of mining github,” inProceedings of the 11th working conference on mining software repositories, 2014, pp. 92–101. N. Munaiah, S. Kroh, C. Cabrey, and M. Nagappan, “Curating github for engineered software projects,”Empirical Software Engineering, vol....
work page 2014
-
[21]
IEEE, 2019, pp. 21–26. B. Okken,Python Testing with pytest. Pragmatic Bookshelf,
work page 2019
-
[22]
Pytest-smell: a smell detection tool for python unit tests,
A. Bodea, “Pytest-smell: a smell detection tool for python unit tests,” inProceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis, 2022, pp. 793–796. B. Cui, J. Li, T. Guo, J. Wang, and D. Ma, “Code comparison system based on abstract syntax tree,” in2010 3rd IEEE International Conference on Broadband 46 Mohammed Mehed...
work page 2022
-
[23]
As code testing: Characterizing test quality in open source ansible development,
M. M. Hassan and A. Rahman, “As code testing: Characterizing test quality in open source ansible development,” in2022 IEEE Conference on Software Testing, Verification and Validation (ICST). IEEE, 2022, pp. 208–219. S. Gueron, S. Johnson, and J. Walker, “Sha-512/256,” in2011 Eighth International Conference on Information Technology: New Generations. IEEE,...
work page 2022
-
[24]
The test automation manifesto,
G. Meszaros, S. M. Smith, and J. Andrea, “The test automation manifesto,” in Conference on extreme programming and agile methods. Springer, 2003, pp. 73–
work page 2003
-
[25]
Assertionsarestronglycorrelatedwithtestsuiteeffective- ness,
Y.ZhangandA.Mesbah,“Assertionsarestronglycorrelatedwithtestsuiteeffective- ness,” inProceedings of the 2015 10th Joint Meeting on Foundations of Software Engineering, 2015, pp. 214–224. M. J. Parker, C. Anderson, C. Stone, and Y. Oh, “A large language model approach to educational survey feedback analysis,”International journal of artificial intel- ligenc...
work page 2015
-
[26]
A. Bhargava, C. Witkowski, A. Detkov, and M. Thomson, “Prompt baking,”arXiv preprint arXiv:2409.13697,
-
[27]
Anempiricalstudyontheuseofsnapshot testing,
S.Fujita,Y.Kashiwa,B.Lin,andH.Iida,“Anempiricalstudyontheuseofsnapshot testing,” in2023 IEEE International Conference on Software Maintenance and Evolution (ICSME). IEEE, 2023, pp. 335–340. W. Lam, S. Srisakaokul, B. Bassett, P. Mahdian, T. Xie, P. Lakshman, and J. De Halleux, “A characteristic study of parameterized unit tests in. net open source project...
work page 2023
-
[28]
G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment
Y. Liu, D. Iter, Y. Xu, S. Wang, R. Xu, and C. Zhu, “G-eval: Nlg evaluation using gpt-4 with better human alignment,”arXiv preprint arXiv:2303.16634,
work page internal anchor Pith review Pith/arXiv arXiv
-
[29]
Ragas: Automated evaluation of retrieval augmented generation,
S. Es, J. James, L. E. Anke, and S. Schockaert, “Ragas: Automated evaluation of retrieval augmented generation,” inProceedings of the 18th Conference of the Eu- ropean Chapter of the Association for Computational Linguistics: System Demon- strations, 2024, pp. 150–158. Testing Practices in AI Agent Frameworks and Agentic Applications 47 L. Zheng, W.-L. Ch...
work page 2024
-
[30]
A systematic evaluation of large language models of code,
F. F. Xu, U. Alon, G. Neubig, and V. J. Hellendoorn, “A systematic evaluation of large language models of code,” inProceedings of the 6th ACM SIGPLAN Inter- national Symposium on Machine Programming, 2022, pp. 1–10. N. Tillmann and W. Schulte, “Parameterized unit tests,”ACM SIGSOFT Software Engineering Notes, vol. 30, no. 5, pp. 253–262,
work page 2022
-
[31]
A Survey on the Memory Mechanism of Large Language Model based Agents
Z. Zhang, X. Bo, C. Ma, R. Li, X. Chen, Q. Dai, J. Zhu, Z. Dong, and J.-R. Wen, “A survey on the memory mechanism of large language model based agents,”arXiv preprint arXiv:2404.13501,
work page internal anchor Pith review Pith/arXiv arXiv
-
[32]
F. Bang, “Gptcache: An open-source semantic cache for llm applications enabling faster answers and cost savings,” inProceedings of the 3rd Workshop for Natural Language Processing Open Source Software (NLP-OSS 2023), 2023, pp. 212–218. Q. Zhang, M. Wornow, and K. Olukotun, “Cost-efficient serving of llm agents via test-time plan caching,”arXiv preprint ar...
-
[33]
G. Marvin, N. Hellen, D. Jjingo, and J. Nakatumba-Nabende, “Prompt engineering in large language models,” inInternational conference on data intelligence and cognitive informatics. Springer, 2023, pp. 387–402. R. C. Barron, V. Grantcharov, S. Wanna, M. E. Eren, M. Bhattarai, N. Solovyev, G. Tompkins, C. Nicholas, K. Ø. Rasmussen, C. Matuszeket al., “Domai...
-
[34]
Methodology for quality assurance testing of llm-based multi-agent systems,
I. Shamim and R. Singhal, “Methodology for quality assurance testing of llm-based multi-agent systems,” inProceedings of the 4th International Conference on AI- ML Systems, 2024, pp. 1–5. 48 Mohammed Mehedi Hasan et al. A Example of Testing Patterns A.1 Structural Patterns A.1.1 Hyperparameter Control Listing 1: Hyperparameter Control: On line 6 hyperpara...
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.