An Empirical Study of Testing Practices in Open Source AI Agent Frameworks and Agentic Applications

Ahmed E. Hassan; Bram Adams; Emad Fallahzadeh; Gopi Krishnan Rajbahadur; Hao Li; Mohammed Mehedi Hasan

arxiv: 2509.19185 · v3 · submitted 2025-09-23 · 💻 cs.SE · cs.ET

An Empirical Study of Testing Practices in Open Source AI Agent Frameworks and Agentic Applications

Mohammed Mehedi Hasan , Hao Li , Emad Fallahzadeh , Gopi Krishnan Rajbahadur , Bram Adams , Ahmed E. Hassan This is my paper

Pith reviewed 2026-05-18 14:08 UTC · model grok-4.3

classification 💻 cs.SE cs.ET

keywords AI agentstesting practicesempirical studyopen sourcefoundation modelsagentic applicationssoftware testingnon-determinism

0 comments

The pith

Developers of AI agents test deterministic tools and workflows far more than the uncertain prompts and planning that define agent behavior.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper conducts the first large-scale empirical study of testing practices in the AI agent ecosystem by analyzing 39 open-source agent frameworks and 439 agentic applications. It identifies ten distinct testing patterns and maps them to the architectural components of these systems. The study reveals that testing effort is heavily skewed toward deterministic elements such as tools and workflows, which consume over 70 percent of the effort, while the core foundation model planning logic receives less than 5 percent and prompts are tested in only about 1 percent of cases. This inversion highlights a blind spot in handling the non-determinism inherent to AI agents. Readers should care because inadequate testing of uncertain components may lead to unreliable agent behavior in real-world applications.

Core claim

By analyzing 39 open-source AI agent frameworks and 439 agentic applications, the paper identifies ten testing patterns and maps them to architectural components, revealing that Resource Artifacts and Coordination Artifacts account for over 70% of testing effort while the FM-based Plan Body receives less than 5% and the Trigger component appears in only around 1% of tests. Novel methods like DeepEval are seldom used while traditional patterns are adapted to manage FM uncertainty, providing the first empirical testing baseline and showing a rational but incomplete adaptation to non-determinism.

What carries the argument

Mapping of observed testing patterns to canonical architectural components of agent frameworks and applications, specifically Resource Artifacts (tools), Coordination Artifacts (workflows), FM-based Plan Body, and Trigger (prompts).

If this is right

Framework developers should improve support for novel testing methods like DeepEval.
Application developers must adopt prompt regression testing to cover the neglected Trigger component.
Researchers should explore barriers to adoption of practices that better address non-determinism.
Strengthening these practices is vital for building more robust and dependable AI agents.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same testing imbalances could appear in closed-source or enterprise AI agent projects where data access differs.
Better prompt testing might reduce unexpected failures when agents are deployed in variable real-world conditions.
Automated tools for generating regression tests on prompts could help close the observed coverage gap over time.

Load-bearing premise

The 39 frameworks and 439 applications selected from open source repositories are representative of broader developer practices, and the identification of test patterns accurately captures how tests relate to architectural components without significant misclassification.

What would settle it

A replication study using a different selection of frameworks and applications that finds testing effort distributed more evenly or with substantially higher coverage for the Trigger component and Plan Body.

Figures

Figures reproduced from arXiv: 2509.19185 by Ahmed E. Hassan, Bram Adams, Emad Fallahzadeh, Gopi Krishnan Rajbahadur, Hao Li, Mohammed Mehedi Hasan.

**Figure 1.** Figure 1: Overview of the Research Method by conducting the first large-scale empirical investigation of unit testing practices in open-source agent frameworks and the agentic applications built upon them. 5 Methodology In this section, we describe the methodology used to investigate testing practices in AI agent frameworks and agentic applications. In this study, we adopt a multi-stage empirical approach consisting… view at source ↗

**Figure 2.** Figure 2: A Sample Test Function marked with Arrange, Act, Assert blocks. [PITH_FULL_IMAGE:figures/full_fig_p022_2.png] view at source ↗

**Figure 3.** Figure 3: Overview of testing patterns observed in agent frameworks and applications. [PITH_FULL_IMAGE:figures/full_fig_p024_3.png] view at source ↗

**Figure 4.** Figure 4: Example DeepEval test case that verifies whether the retrieved output in [PITH_FULL_IMAGE:figures/full_fig_p027_4.png] view at source ↗

**Figure 5.** Figure 5: Test function where temperature is set as 0 for forcing FMs to always select [PITH_FULL_IMAGE:figures/full_fig_p027_5.png] view at source ↗

**Figure 6.** Figure 6: Co-occurrence frequency of verification patterns in the same test function in [PITH_FULL_IMAGE:figures/full_fig_p028_6.png] view at source ↗

**Figure 7.** Figure 7: An overview illustrating the mapping between SUTs and canonical agent [PITH_FULL_IMAGE:figures/full_fig_p033_7.png] view at source ↗

**Figure 8.** Figure 8: Components tested in Agent Ecosystem typically commands 24–30% of the focus (Openja et al., 2024). This suggests that developers are strategically investing their limited testing resources in the components they can reliably control and verify. Despite the Foundational Model’s strong dependency on Triggers (prompts), testing this component remains critically under-addressed, highlighting a substantial bli… view at source ↗

**Figure 9.** Figure 9: Component-wise Testing Patterns lapping of testing effort across development layers. As shown in [PITH_FULL_IMAGE:figures/full_fig_p035_9.png] view at source ↗

read the original abstract

Foundation model (FM)-based AI agents are rapidly gaining adoption across diverse domains, but their inherent non-determinism and non-reproducibility pose testing and quality assurance challenges. While recent benchmarks provide task-level evaluations, there is limited understanding of how developers verify the internal correctness of these agents during development. To address this gap, we conduct the first large-scale empirical study of testing practices in the AI agent ecosystem, analyzing 39 open-source agent frameworks and 439 agentic applications. We identify ten distinct testing patterns and find that novel, agent-specific methods like DeepEval are seldom used (around 1%), while traditional patterns like negative and membership testing are widely adapted to manage FM uncertainty. By mapping these patterns to canonical architectural components of agent frameworks and agentic applications, we uncover a fundamental inversion of testing effort: deterministic components like Resource Artifacts (tools) and Coordination Artifacts (workflows) consume over 70% of testing effort, while the FM-based Plan Body receives less than 5%. Crucially, this reveals a critical blind spot, as the Trigger component (prompts) remains neglected, appearing in around 1% of all tests. Our findings offer the first empirical testing baseline in FM-based agent frameworks and agentic applications, revealing a rational but incomplete adaptation to non-determinism. To address it, framework developers should improve support for novel testing methods, application developers must adopt prompt regression testing, and researchers should explore barriers to adoption. Strengthening these practices is vital for building more robust and dependable AI agents.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper maps testing practices across 39 AI agent frameworks and 439 applications and finds most effort goes to tools and workflows while prompts and core FM planning get almost none.

read the letter

This paper's main takeaway is that testing in open source AI agent projects is heavily skewed toward the non-AI components. They looked at 39 frameworks and 439 applications and found that over 70% of tests hit tools and workflows, while the core FM-based planning gets under 5% and prompts around 1%.

What stands out as new is the mapping of ten testing patterns to the agent architecture. They show that traditional methods like negative testing get adapted for uncertainty, but newer tools like DeepEval see almost no use. The inversion they describe is a useful observation for anyone thinking about where the real risks sit in these systems.

The work does a solid job of scaling up the analysis to real repositories and giving concrete counts. It avoids overclaiming by framing the findings as a baseline rather than a complete picture.

The main soft spot is the test-to-component mapping. The stress test note flags the lack of inter-rater reliability or validation steps for assigning tests to Resource Artifacts, Coordination Artifacts, Plan Body, or Trigger. If there's any ambiguity in tests that touch multiple parts, the percentages could be off by enough to change the story on prompt neglect. The abstract doesn't spell out the exact procedure or any error checks, so that part needs more detail to hold up.

This paper is for software engineering researchers focused on AI agents and for framework maintainers who want to see where current practices fall short. A reader interested in empirical data on development habits will find it worth their time.

It deserves a serious referee. The scale and the practical angle make it worth reviewing, even if the methods section needs strengthening. I would recommend sending it to peer review with feedback on clarifying the classification process.

Referee Report

1 major / 2 minor

Summary. The manuscript reports the first large-scale empirical study of testing practices in the AI agent ecosystem. The authors analyze 39 open-source agent frameworks and 439 agentic applications, identify ten testing patterns, observe low adoption of novel agent-specific methods such as DeepEval (around 1%), and map observed tests to four canonical architectural components. This mapping yields the central claim of an inversion of testing effort: deterministic components (Resource Artifacts/tools and Coordination Artifacts/workflows) account for over 70% of tests, the FM-based Plan Body receives less than 5%, and the Trigger component (prompts) appears in around 1% of tests. The authors conclude that current practices represent a rational but incomplete adaptation to non-determinism.

Significance. If the mapping procedure is shown to be reliable, the work supplies a much-needed empirical baseline on how developers actually test internal correctness in FM-based agents. The scale of the corpus and the concrete identification of a prompt-testing blind spot are strengths that could usefully inform framework design and future research on agent robustness.

major comments (1)

[Mapping procedure (Section 4)] The quantitative claims of the effort inversion (over 70% deterministic, <5% Plan Body, ~1% Trigger) rest entirely on the test-to-component mapping procedure. The manuscript describes the mapping but reports neither inter-rater reliability statistics, a double-coded subsample, nor an error analysis for ambiguous cases such as tests that exercise both a tool and a prompt. Because even moderate misclassification rates would materially affect the reported percentages, this omission is load-bearing for the central result.

minor comments (2)

[Abstract] The abstract uses 'around 1%' for both DeepEval adoption and Trigger coverage; a single sentence distinguishing the two figures would improve clarity.
[Results] Table or figure presenting the per-component test counts would make the 70%/5%/1% inversion easier to verify at a glance.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback and positive evaluation of the manuscript's significance. We address the single major comment below and will revise the paper accordingly to strengthen the reliability of our mapping procedure.

read point-by-point responses

Referee: [Mapping procedure (Section 4)] The quantitative claims of the effort inversion (over 70% deterministic, <5% Plan Body, ~1% Trigger) rest entirely on the test-to-component mapping procedure. The manuscript describes the mapping but reports neither inter-rater reliability statistics, a double-coded subsample, nor an error analysis for ambiguous cases such as tests that exercise both a tool and a prompt. Because even moderate misclassification rates would materially affect the reported percentages, this omission is load-bearing for the central result.

Authors: We agree that formal reliability assessment is important to support the central quantitative claims. The mapping was performed by the authors through iterative discussion to resolve disagreements, but the submitted manuscript does not include inter-rater reliability statistics, a double-coded subsample, or a dedicated error analysis. For the revision, we will randomly select a 10% subsample of tests (approximately 50 tests), have two independent coders perform the mapping, compute Cohen's kappa, and add both the reliability results and an error analysis (including discussion of ambiguous cases such as tests involving both tools and prompts) as a new subsection in Section 4. This will directly strengthen the validity of the reported percentages. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical counts from external repositories

full rationale

The paper conducts a direct empirical study by inspecting 39 open-source agent frameworks and 439 agentic applications. Testing patterns are identified from code and test artifacts, then mapped to four architectural components (Resource Artifacts, Coordination Artifacts, Plan Body, Trigger). The reported effort shares (over 70%, less than 5%, around 1%) are simple tallies of observed test instances. No equations, fitted parameters, or self-citations are used to derive these quantities; the mapping procedure is described as manual or semi-automated inspection of external repositories. The analysis is therefore self-contained against external benchmarks and does not reduce any claim to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on the representativeness of the chosen open-source projects and the accuracy of test pattern identification; no free parameters or invented entities are introduced.

axioms (1)

domain assumption The selected open-source AI agent frameworks and applications are representative of typical development and testing practices in the ecosystem.
Invoked when generalizing findings from the 39 frameworks and 439 applications to the broader field.

pith-pipeline@v0.9.0 · 5833 in / 1272 out tokens · 82184 ms · 2026-05-18T14:08:26.259025+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

By mapping these patterns to canonical architectural components of agent frameworks and agentic applications, we uncover a fundamental inversion of testing effort: deterministic components like Resource Artifacts (tools) and Coordination Artifacts (workflows) consume over 70% of testing effort, while the FM-based Plan Body receives less than 5%. Crucially, this reveals a critical blind spot, as the Trigger component (prompts) remains neglected, appearing in around 1% of all tests.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Agentic Frameworks for Reasoning Tasks: An Empirical Study
cs.AI 2026-04 unverdicted novelty 6.0

An empirical evaluation of 22 agentic frameworks on BBH, GSM8K, and ARC benchmarks shows stable performance in 12 frameworks but highlights orchestration failures and weaker mathematical reasoning.
Dissecting Bug Triggers and Failure Modes in Modern Agentic Frameworks: An Empirical Study
cs.SE 2026-04 unverdicted novelty 6.0

Analysis of bugs in modern agentic frameworks uncovers unique symptoms like unexpected execution sequences and root causes including model faults and orchestration issues, with transferable patterns across designs.
Characterizing Faults in Agentic AI: A Taxonomy of Types, Symptoms, and Root Causes
cs.SE 2026-03 unverdicted novelty 6.0

An empirical study of real-world issues yields a taxonomy of 34 fault types, symptoms, and root causes in agentic AI systems, validated by 145 practitioners.

Reference graph

Works this paper leans on

34 extracted references · 34 canonical work pages · cited by 3 Pith papers · 9 internal anchors

[1]

AgentBench: Evaluating LLMs as Agents

X. Liu, H. Yu, H. Zhang, Y. Xu, X. Lei, H. Lai, Y. Gu, H. Ding, K. Men, K. Yang et al., “Agentbench: Evaluating llms as agents,”arXiv preprint arXiv:2308.03688,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Generative agents: Interactive simulacra of human behavior,

J. S. Park, J. O’Brien, C. J. Cai, M. R. Morris, P. Liang, and M. S. Bernstein, “Generative agents: Interactive simulacra of human behavior,” inProceedings of the 36th annual acm symposium on user interface software and technology, 2023, pp. 1–22. Q. Wu, G. Bansal, J. Zhang, Y. Wu, B. Li, E. Zhu, L. Jiang, X. Zhang, S. Zhang, J. Liuet al., “Autogen: Enabl...

work page 2023
[3]

WebArena: A Realistic Web Environment for Building Autonomous Agents

S. Zhou, F. F. Xu, H. Zhu, X. Zhou, R. Lo, A. Sridhar, X. Cheng, T. Ou, Y. Bisk, D. Friedet al., “Webarena: A realistic web environment for building autonomous agents,”arXiv preprint arXiv:2307.13854,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

MINT: Evaluating LLMs in multi-turn interaction with tools and language feedback.arXiv preprint arXiv:2309.10691, 2023

X. Wang, Z. Wang, J. Liu, Y. Chen, L. Yuan, H. Peng, and H. Ji, “Mint: Evaluating llms in multi-turn interaction with tools and language feedback,”arXiv preprint arXiv:2309.10691,

work page arXiv
[5]

evaluating student performance

L. Weidinger, I. D. Raji, H. Wallach, M. Mitchell, A. Wang, O. Salaudeen, R. Bom- masani, D. Ganguli, S. Koyejo, and W. Isaac, “Toward an evaluation science for generative ai systems,”arXiv preprint arXiv:2503.05336,

work page arXiv
[6]

Will my tests tell me if i break this code?

R. Niedermayr, E. Juergens, and S. Wagner, “Will my tests tell me if i break this code?” inProceedings of the International Workshop on Continuous Software Evo- lution and Delivery, 2016, pp. 23–29. Testing Practices in AI Agent Frameworks and Agentic Applications 43 A. E. Hassan, D. Lin, G. K. Rajbahadur, K. Gallaba, F. R. Cogo, B. Chen, H. Zhang, K. Tha...

work page 2016
[7]

Model Context Protocol (MCP) at First Glance: Studying the Security and Maintainability of MCP Servers

[Online]. Available: https://arxiv.org/abs/2506.13538 A. Ehtesham, A. Singh, G. K. Gupta, and S. Kumar, “A survey of agent interop- erability protocols: Model context protocol (mcp), agent communication protocol (acp), agent-to-agent protocol (a2a), and agent network protocol (anp),”arXiv preprint arXiv:2505.02279,

work page internal anchor Pith review Pith/arXiv arXiv
[8]

A taxonomy for autonomous llm-powered multi-agent architectures

T. Händler, “A taxonomy for autonomous llm-powered multi-agent architectures.” inKMIS, 2023, pp. 85–98. O. Boissier, R. H. Bordini, J. Hubner, and A. Ricci,Multi-agent oriented program- ming: programming multi-agent systems using JaCaMo. Mit Press,

work page 2023
[9]

Exploring the composition of unit test suites,

B. Van Rompaey and S. Demeyer, “Exploring the composition of unit test suites,” in2008 23rd IEEE/ACM International Conference on Automated Soft- ware Engineering-Workshops. IEEE, 2008, pp. 11–20. Y. Tao, “An introduction to assertion-based verification,” in2009 IEEE 8th Inter- national Conference on ASIC. IEEE, 2009, pp. 1318–1323. 44 Mohammed Mehedi Hasa...

work page 2008
[10]

Balancing autonomy and alignment: a multi-dimensional taxon- omy for autonomous llm-powered multi-agent architectures,

T. Händler, “Balancing autonomy and alignment: a multi-dimensional taxon- omy for autonomous llm-powered multi-agent architectures,”arXiv preprint arXiv:2310.03659,

work page arXiv
[11]

The Landscape of Emerging AI Agent Architectures for Reasoning, Planning, and Tool Calling: A Survey

T. Masterman, S. Besen, M. Sawtell, and A. Chao, “The landscape of emerging ai agent architectures for reasoning, planning, and tool calling: A survey,”arXiv preprint arXiv:2404.11584,

work page internal anchor Pith review Pith/arXiv arXiv
[12]

Exploring large language model based intelligent agents: Definitions, methods, and prospects

Y. Cheng, C. Zhang, Z. Zhang, X. Meng, S. Hong, W. Li, Z. Wang, Z. Wang, F. Yin, J. Zhaoet al., “Exploring large language model based intelligent agents: Defini- tions, methods, and prospects,”arXiv preprint arXiv:2401.03428,

work page arXiv
[13]

A large-scale study on the usage of testing patterns that address maintainability attributes: patterns for ease of modification, diagnoses, and comprehension,

D. Gonzalez, J. C. Santos, A. Popovich, M. Mirakhorli, and M. Nagappan, “A large-scale study on the usage of testing patterns that address maintainability attributes: patterns for ease of modification, diagnoses, and comprehension,” in 2017 IEEE/ACM 14th International Conference on Mining Software Repositories (MSR). IEEE, 2017, pp. 391–401. C. Wei, L. Xi...

work page 2017
[14]

Carving parameterized unit tests,

A. Kampmann and A. Zeller, “Carving parameterized unit tests,” in2019 IEEE/ACM 41st International Conference on Software Engineering: Companion Proceedings (ICSE-Companion). IEEE, 2019, pp. 248–249. A. Fontes and G. Gay, “The integration of machine learning into automated test generation: A systematic mapping study,”Software Testing, Verification and Re- ...

work page 2019
[15]

Deepxplore: Automated whitebox testing of deep learning systems,

Testing Practices in AI Agent Frameworks and Agentic Applications 45 K. Pei, Y. Cao, J. Yang, and S. Jana, “Deepxplore: Automated whitebox testing of deep learning systems,” inproceedings of the 26th Symposium on Operating Systems Principles, 2017, pp. 1–18. Y. Nishi, S. Masuda, H. Ogawa, and K. Uetsuki, “A test architecture for machine learning product,”...

work page 2017
[16]

Large Language Model based Multi-Agents: A Survey of Progress and Challenges

T. Guo, X. Chen, Y. Wang, R. Chang, S. Pei, N. V. Chawla, O. Wiest, and X. Zhang, “Large language model based multi-agents: A survey of progress and challenges,” arXiv preprint arXiv:2402.01680,

work page internal anchor Pith review Pith/arXiv arXiv
[17]

AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation

Q. Wu, G. Bansal, J. Zhang, Y. Wu, S. Zhang, E. Zhu, B. Li, L. Jiang, X. Zhang, and C. Wang, “Autogen: Enabling next-gen llm applications via multi-agent con- versation framework,”arXiv preprint arXiv:2308.08155,

work page internal anchor Pith review Pith/arXiv arXiv
[18]

MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework

S. Hong, X. Zheng, J. Chen, Y. Cheng, J. Wang, C. Zhang, Z. Wang, S. K. S. Yau, Z. Lin, L. Zhouet al., “Metagpt: Meta programming for multi-agent collaborative framework,”arXiv preprint arXiv:2308.00352, vol. 3, no. 4, p. 6,

work page internal anchor Pith review Pith/arXiv arXiv
[19]

The state of the ml-universe: 10 years of artificial intelligence & machine learning software development on github,

D. Gonzalez, T. Zimmermann, and N. Nagappan, “The state of the ml-universe: 10 years of artificial intelligence & machine learning software development on github,” inProceedings of the 17th International conference on mining software repositories, 2020, pp. 431–442. H. Li and C.-P. Bezemer, “Bridging the language gap: an empirical study of bindings for op...

work page 2020
[20]

The promises and perils of mining github,

E. Kalliamvakou, G. Gousios, K. Blincoe, L. Singer, D. M. German, and D. Damian, “The promises and perils of mining github,” inProceedings of the 11th working conference on mining software repositories, 2014, pp. 92–101. N. Munaiah, S. Kroh, C. Cabrey, and M. Nagappan, “Curating github for engineered software projects,”Empirical Software Engineering, vol....

work page 2014
[21]

IEEE, 2019, pp. 21–26. B. Okken,Python Testing with pytest. Pragmatic Bookshelf,

work page 2019
[22]

Pytest-smell: a smell detection tool for python unit tests,

A. Bodea, “Pytest-smell: a smell detection tool for python unit tests,” inProceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis, 2022, pp. 793–796. B. Cui, J. Li, T. Guo, J. Wang, and D. Ma, “Code comparison system based on abstract syntax tree,” in2010 3rd IEEE International Conference on Broadband 46 Mohammed Mehed...

work page 2022
[23]

As code testing: Characterizing test quality in open source ansible development,

M. M. Hassan and A. Rahman, “As code testing: Characterizing test quality in open source ansible development,” in2022 IEEE Conference on Software Testing, Verification and Validation (ICST). IEEE, 2022, pp. 208–219. S. Gueron, S. Johnson, and J. Walker, “Sha-512/256,” in2011 Eighth International Conference on Information Technology: New Generations. IEEE,...

work page 2022
[24]

The test automation manifesto,

G. Meszaros, S. M. Smith, and J. Andrea, “The test automation manifesto,” in Conference on extreme programming and agile methods. Springer, 2003, pp. 73–

work page 2003
[25]

Assertionsarestronglycorrelatedwithtestsuiteeffective- ness,

Y.ZhangandA.Mesbah,“Assertionsarestronglycorrelatedwithtestsuiteeffective- ness,” inProceedings of the 2015 10th Joint Meeting on Foundations of Software Engineering, 2015, pp. 214–224. M. J. Parker, C. Anderson, C. Stone, and Y. Oh, “A large language model approach to educational survey feedback analysis,”International journal of artificial intel- ligenc...

work page 2015
[26]

Prompt baking,

A. Bhargava, C. Witkowski, A. Detkov, and M. Thomson, “Prompt baking,”arXiv preprint arXiv:2409.13697,

work page arXiv
[27]

Anempiricalstudyontheuseofsnapshot testing,

S.Fujita,Y.Kashiwa,B.Lin,andH.Iida,“Anempiricalstudyontheuseofsnapshot testing,” in2023 IEEE International Conference on Software Maintenance and Evolution (ICSME). IEEE, 2023, pp. 335–340. W. Lam, S. Srisakaokul, B. Bassett, P. Mahdian, T. Xie, P. Lakshman, and J. De Halleux, “A characteristic study of parameterized unit tests in. net open source project...

work page 2023
[28]

G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment

Y. Liu, D. Iter, Y. Xu, S. Wang, R. Xu, and C. Zhu, “G-eval: Nlg evaluation using gpt-4 with better human alignment,”arXiv preprint arXiv:2303.16634,

work page internal anchor Pith review Pith/arXiv arXiv
[29]

Ragas: Automated evaluation of retrieval augmented generation,

S. Es, J. James, L. E. Anke, and S. Schockaert, “Ragas: Automated evaluation of retrieval augmented generation,” inProceedings of the 18th Conference of the Eu- ropean Chapter of the Association for Computational Linguistics: System Demon- strations, 2024, pp. 150–158. Testing Practices in AI Agent Frameworks and Agentic Applications 47 L. Zheng, W.-L. Ch...

work page 2024
[30]

A systematic evaluation of large language models of code,

F. F. Xu, U. Alon, G. Neubig, and V. J. Hellendoorn, “A systematic evaluation of large language models of code,” inProceedings of the 6th ACM SIGPLAN Inter- national Symposium on Machine Programming, 2022, pp. 1–10. N. Tillmann and W. Schulte, “Parameterized unit tests,”ACM SIGSOFT Software Engineering Notes, vol. 30, no. 5, pp. 253–262,

work page 2022
[31]

A Survey on the Memory Mechanism of Large Language Model based Agents

Z. Zhang, X. Bo, C. Ma, R. Li, X. Chen, Q. Dai, J. Zhu, Z. Dong, and J.-R. Wen, “A survey on the memory mechanism of large language model based agents,”arXiv preprint arXiv:2404.13501,

work page internal anchor Pith review Pith/arXiv arXiv
[32]

Agentic plan caching: Test-time memory for fast and cost-efficient LLM agents.arXiv preprint arXiv:2506.14852,

F. Bang, “Gptcache: An open-source semantic cache for llm applications enabling faster answers and cost savings,” inProceedings of the 3rd Workshop for Natural Language Processing Open Source Software (NLP-OSS 2023), 2023, pp. 212–218. Q. Zhang, M. Wornow, and K. Olukotun, “Cost-efficient serving of llm agents via test-time plan caching,”arXiv preprint ar...

work page arXiv 2023
[33]

arXiv:2404.08335

G. Marvin, N. Hellen, D. Jjingo, and J. Nakatumba-Nabende, “Prompt engineering in large language models,” inInternational conference on data intelligence and cognitive informatics. Springer, 2023, pp. 387–402. R. C. Barron, V. Grantcharov, S. Wanna, M. E. Eren, M. Bhattarai, N. Solovyev, G. Tompkins, C. Nicholas, K. Ø. Rasmussen, C. Matuszeket al., “Domai...

work page arXiv 2023
[34]

Methodology for quality assurance testing of llm-based multi-agent systems,

I. Shamim and R. Singhal, “Methodology for quality assurance testing of llm-based multi-agent systems,” inProceedings of the 4th International Conference on AI- ML Systems, 2024, pp. 1–5. 48 Mohammed Mehedi Hasan et al. A Example of Testing Patterns A.1 Structural Patterns A.1.1 Hyperparameter Control Listing 1: Hyperparameter Control: On line 6 hyperpara...

work page 2024

[1] [1]

AgentBench: Evaluating LLMs as Agents

X. Liu, H. Yu, H. Zhang, Y. Xu, X. Lei, H. Lai, Y. Gu, H. Ding, K. Men, K. Yang et al., “Agentbench: Evaluating llms as agents,”arXiv preprint arXiv:2308.03688,

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

Generative agents: Interactive simulacra of human behavior,

J. S. Park, J. O’Brien, C. J. Cai, M. R. Morris, P. Liang, and M. S. Bernstein, “Generative agents: Interactive simulacra of human behavior,” inProceedings of the 36th annual acm symposium on user interface software and technology, 2023, pp. 1–22. Q. Wu, G. Bansal, J. Zhang, Y. Wu, B. Li, E. Zhu, L. Jiang, X. Zhang, S. Zhang, J. Liuet al., “Autogen: Enabl...

work page 2023

[3] [3]

WebArena: A Realistic Web Environment for Building Autonomous Agents

S. Zhou, F. F. Xu, H. Zhu, X. Zhou, R. Lo, A. Sridhar, X. Cheng, T. Ou, Y. Bisk, D. Friedet al., “Webarena: A realistic web environment for building autonomous agents,”arXiv preprint arXiv:2307.13854,

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

MINT: Evaluating LLMs in multi-turn interaction with tools and language feedback.arXiv preprint arXiv:2309.10691, 2023

X. Wang, Z. Wang, J. Liu, Y. Chen, L. Yuan, H. Peng, and H. Ji, “Mint: Evaluating llms in multi-turn interaction with tools and language feedback,”arXiv preprint arXiv:2309.10691,

work page arXiv

[5] [5]

evaluating student performance

L. Weidinger, I. D. Raji, H. Wallach, M. Mitchell, A. Wang, O. Salaudeen, R. Bom- masani, D. Ganguli, S. Koyejo, and W. Isaac, “Toward an evaluation science for generative ai systems,”arXiv preprint arXiv:2503.05336,

work page arXiv

[6] [6]

Will my tests tell me if i break this code?

R. Niedermayr, E. Juergens, and S. Wagner, “Will my tests tell me if i break this code?” inProceedings of the International Workshop on Continuous Software Evo- lution and Delivery, 2016, pp. 23–29. Testing Practices in AI Agent Frameworks and Agentic Applications 43 A. E. Hassan, D. Lin, G. K. Rajbahadur, K. Gallaba, F. R. Cogo, B. Chen, H. Zhang, K. Tha...

work page 2016

[7] [7]

Model Context Protocol (MCP) at First Glance: Studying the Security and Maintainability of MCP Servers

[Online]. Available: https://arxiv.org/abs/2506.13538 A. Ehtesham, A. Singh, G. K. Gupta, and S. Kumar, “A survey of agent interop- erability protocols: Model context protocol (mcp), agent communication protocol (acp), agent-to-agent protocol (a2a), and agent network protocol (anp),”arXiv preprint arXiv:2505.02279,

work page internal anchor Pith review Pith/arXiv arXiv

[8] [8]

A taxonomy for autonomous llm-powered multi-agent architectures

T. Händler, “A taxonomy for autonomous llm-powered multi-agent architectures.” inKMIS, 2023, pp. 85–98. O. Boissier, R. H. Bordini, J. Hubner, and A. Ricci,Multi-agent oriented program- ming: programming multi-agent systems using JaCaMo. Mit Press,

work page 2023

[9] [9]

Exploring the composition of unit test suites,

B. Van Rompaey and S. Demeyer, “Exploring the composition of unit test suites,” in2008 23rd IEEE/ACM International Conference on Automated Soft- ware Engineering-Workshops. IEEE, 2008, pp. 11–20. Y. Tao, “An introduction to assertion-based verification,” in2009 IEEE 8th Inter- national Conference on ASIC. IEEE, 2009, pp. 1318–1323. 44 Mohammed Mehedi Hasa...

work page 2008

[10] [10]

Balancing autonomy and alignment: a multi-dimensional taxon- omy for autonomous llm-powered multi-agent architectures,

T. Händler, “Balancing autonomy and alignment: a multi-dimensional taxon- omy for autonomous llm-powered multi-agent architectures,”arXiv preprint arXiv:2310.03659,

work page arXiv

[11] [11]

The Landscape of Emerging AI Agent Architectures for Reasoning, Planning, and Tool Calling: A Survey

T. Masterman, S. Besen, M. Sawtell, and A. Chao, “The landscape of emerging ai agent architectures for reasoning, planning, and tool calling: A survey,”arXiv preprint arXiv:2404.11584,

work page internal anchor Pith review Pith/arXiv arXiv

[12] [12]

Exploring large language model based intelligent agents: Definitions, methods, and prospects

Y. Cheng, C. Zhang, Z. Zhang, X. Meng, S. Hong, W. Li, Z. Wang, Z. Wang, F. Yin, J. Zhaoet al., “Exploring large language model based intelligent agents: Defini- tions, methods, and prospects,”arXiv preprint arXiv:2401.03428,

work page arXiv

[13] [13]

A large-scale study on the usage of testing patterns that address maintainability attributes: patterns for ease of modification, diagnoses, and comprehension,

D. Gonzalez, J. C. Santos, A. Popovich, M. Mirakhorli, and M. Nagappan, “A large-scale study on the usage of testing patterns that address maintainability attributes: patterns for ease of modification, diagnoses, and comprehension,” in 2017 IEEE/ACM 14th International Conference on Mining Software Repositories (MSR). IEEE, 2017, pp. 391–401. C. Wei, L. Xi...

work page 2017

[14] [14]

Carving parameterized unit tests,

A. Kampmann and A. Zeller, “Carving parameterized unit tests,” in2019 IEEE/ACM 41st International Conference on Software Engineering: Companion Proceedings (ICSE-Companion). IEEE, 2019, pp. 248–249. A. Fontes and G. Gay, “The integration of machine learning into automated test generation: A systematic mapping study,”Software Testing, Verification and Re- ...

work page 2019

[15] [15]

Deepxplore: Automated whitebox testing of deep learning systems,

Testing Practices in AI Agent Frameworks and Agentic Applications 45 K. Pei, Y. Cao, J. Yang, and S. Jana, “Deepxplore: Automated whitebox testing of deep learning systems,” inproceedings of the 26th Symposium on Operating Systems Principles, 2017, pp. 1–18. Y. Nishi, S. Masuda, H. Ogawa, and K. Uetsuki, “A test architecture for machine learning product,”...

work page 2017

[16] [16]

Large Language Model based Multi-Agents: A Survey of Progress and Challenges

T. Guo, X. Chen, Y. Wang, R. Chang, S. Pei, N. V. Chawla, O. Wiest, and X. Zhang, “Large language model based multi-agents: A survey of progress and challenges,” arXiv preprint arXiv:2402.01680,

work page internal anchor Pith review Pith/arXiv arXiv

[17] [17]

AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation

Q. Wu, G. Bansal, J. Zhang, Y. Wu, S. Zhang, E. Zhu, B. Li, L. Jiang, X. Zhang, and C. Wang, “Autogen: Enabling next-gen llm applications via multi-agent con- versation framework,”arXiv preprint arXiv:2308.08155,

work page internal anchor Pith review Pith/arXiv arXiv

[18] [18]

MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework

S. Hong, X. Zheng, J. Chen, Y. Cheng, J. Wang, C. Zhang, Z. Wang, S. K. S. Yau, Z. Lin, L. Zhouet al., “Metagpt: Meta programming for multi-agent collaborative framework,”arXiv preprint arXiv:2308.00352, vol. 3, no. 4, p. 6,

work page internal anchor Pith review Pith/arXiv arXiv

[19] [19]

The state of the ml-universe: 10 years of artificial intelligence & machine learning software development on github,

D. Gonzalez, T. Zimmermann, and N. Nagappan, “The state of the ml-universe: 10 years of artificial intelligence & machine learning software development on github,” inProceedings of the 17th International conference on mining software repositories, 2020, pp. 431–442. H. Li and C.-P. Bezemer, “Bridging the language gap: an empirical study of bindings for op...

work page 2020

[20] [20]

The promises and perils of mining github,

E. Kalliamvakou, G. Gousios, K. Blincoe, L. Singer, D. M. German, and D. Damian, “The promises and perils of mining github,” inProceedings of the 11th working conference on mining software repositories, 2014, pp. 92–101. N. Munaiah, S. Kroh, C. Cabrey, and M. Nagappan, “Curating github for engineered software projects,”Empirical Software Engineering, vol....

work page 2014

[21] [21]

IEEE, 2019, pp. 21–26. B. Okken,Python Testing with pytest. Pragmatic Bookshelf,

work page 2019

[22] [22]

Pytest-smell: a smell detection tool for python unit tests,

A. Bodea, “Pytest-smell: a smell detection tool for python unit tests,” inProceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis, 2022, pp. 793–796. B. Cui, J. Li, T. Guo, J. Wang, and D. Ma, “Code comparison system based on abstract syntax tree,” in2010 3rd IEEE International Conference on Broadband 46 Mohammed Mehed...

work page 2022

[23] [23]

As code testing: Characterizing test quality in open source ansible development,

M. M. Hassan and A. Rahman, “As code testing: Characterizing test quality in open source ansible development,” in2022 IEEE Conference on Software Testing, Verification and Validation (ICST). IEEE, 2022, pp. 208–219. S. Gueron, S. Johnson, and J. Walker, “Sha-512/256,” in2011 Eighth International Conference on Information Technology: New Generations. IEEE,...

work page 2022

[24] [24]

The test automation manifesto,

G. Meszaros, S. M. Smith, and J. Andrea, “The test automation manifesto,” in Conference on extreme programming and agile methods. Springer, 2003, pp. 73–

work page 2003

[25] [25]

Assertionsarestronglycorrelatedwithtestsuiteeffective- ness,

Y.ZhangandA.Mesbah,“Assertionsarestronglycorrelatedwithtestsuiteeffective- ness,” inProceedings of the 2015 10th Joint Meeting on Foundations of Software Engineering, 2015, pp. 214–224. M. J. Parker, C. Anderson, C. Stone, and Y. Oh, “A large language model approach to educational survey feedback analysis,”International journal of artificial intel- ligenc...

work page 2015

[26] [26]

Prompt baking,

A. Bhargava, C. Witkowski, A. Detkov, and M. Thomson, “Prompt baking,”arXiv preprint arXiv:2409.13697,

work page arXiv

[27] [27]

Anempiricalstudyontheuseofsnapshot testing,

S.Fujita,Y.Kashiwa,B.Lin,andH.Iida,“Anempiricalstudyontheuseofsnapshot testing,” in2023 IEEE International Conference on Software Maintenance and Evolution (ICSME). IEEE, 2023, pp. 335–340. W. Lam, S. Srisakaokul, B. Bassett, P. Mahdian, T. Xie, P. Lakshman, and J. De Halleux, “A characteristic study of parameterized unit tests in. net open source project...

work page 2023

[28] [28]

G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment

Y. Liu, D. Iter, Y. Xu, S. Wang, R. Xu, and C. Zhu, “G-eval: Nlg evaluation using gpt-4 with better human alignment,”arXiv preprint arXiv:2303.16634,

work page internal anchor Pith review Pith/arXiv arXiv

[29] [29]

Ragas: Automated evaluation of retrieval augmented generation,

S. Es, J. James, L. E. Anke, and S. Schockaert, “Ragas: Automated evaluation of retrieval augmented generation,” inProceedings of the 18th Conference of the Eu- ropean Chapter of the Association for Computational Linguistics: System Demon- strations, 2024, pp. 150–158. Testing Practices in AI Agent Frameworks and Agentic Applications 47 L. Zheng, W.-L. Ch...

work page 2024

[30] [30]

A systematic evaluation of large language models of code,

F. F. Xu, U. Alon, G. Neubig, and V. J. Hellendoorn, “A systematic evaluation of large language models of code,” inProceedings of the 6th ACM SIGPLAN Inter- national Symposium on Machine Programming, 2022, pp. 1–10. N. Tillmann and W. Schulte, “Parameterized unit tests,”ACM SIGSOFT Software Engineering Notes, vol. 30, no. 5, pp. 253–262,

work page 2022

[31] [31]

A Survey on the Memory Mechanism of Large Language Model based Agents

Z. Zhang, X. Bo, C. Ma, R. Li, X. Chen, Q. Dai, J. Zhu, Z. Dong, and J.-R. Wen, “A survey on the memory mechanism of large language model based agents,”arXiv preprint arXiv:2404.13501,

work page internal anchor Pith review Pith/arXiv arXiv

[32] [32]

Agentic plan caching: Test-time memory for fast and cost-efficient LLM agents.arXiv preprint arXiv:2506.14852,

F. Bang, “Gptcache: An open-source semantic cache for llm applications enabling faster answers and cost savings,” inProceedings of the 3rd Workshop for Natural Language Processing Open Source Software (NLP-OSS 2023), 2023, pp. 212–218. Q. Zhang, M. Wornow, and K. Olukotun, “Cost-efficient serving of llm agents via test-time plan caching,”arXiv preprint ar...

work page arXiv 2023

[33] [33]

arXiv:2404.08335

G. Marvin, N. Hellen, D. Jjingo, and J. Nakatumba-Nabende, “Prompt engineering in large language models,” inInternational conference on data intelligence and cognitive informatics. Springer, 2023, pp. 387–402. R. C. Barron, V. Grantcharov, S. Wanna, M. E. Eren, M. Bhattarai, N. Solovyev, G. Tompkins, C. Nicholas, K. Ø. Rasmussen, C. Matuszeket al., “Domai...

work page arXiv 2023

[34] [34]

Methodology for quality assurance testing of llm-based multi-agent systems,

I. Shamim and R. Singhal, “Methodology for quality assurance testing of llm-based multi-agent systems,” inProceedings of the 4th International Conference on AI- ML Systems, 2024, pp. 1–5. 48 Mohammed Mehedi Hasan et al. A Example of Testing Patterns A.1 Structural Patterns A.1.1 Hyperparameter Control Listing 1: Hyperparameter Control: On line 6 hyperpara...

work page 2024