Assistance to Autonomy: A Systematic Literature Review of Agentic AI across the Software Development Life Cycle
Pith reviewed 2026-05-19 16:22 UTC · model grok-4.3
The pith
Output verifiability enables agentic AI adoption mainly in later software development phases.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Output verifiability is the primary enabler of agentic adoption: later SDLC phases, whose outputs are objectively evaluable through executable feedback, demonstrate the highest maturity and industrial presence, while earlier phases remain almost exclusively academic proofs-of-concept. The Planner-Executor-Reviewer role specialization is the dominant architectural pattern, with the Reviewer agent implementing verifiability through executable feedback loops. Across all challenge categories, industrial mitigation strategies converge on confining agent actions to verifiable, bounded spaces.
What carries the argument
The Planner-Executor-Reviewer role specialization, which divides responsibilities among agents so that a reviewer can apply executable feedback to confirm task completion.
Load-bearing premise
The 92 manually verified primary studies, selected after multi-agent screening of over 1600 candidates, form a representative sample that accurately captures dominant patterns in both academic and industrial agentic AI use.
What would settle it
Discovery of multiple documented industrial deployments of agentic AI in requirements engineering or high-level design that were missed by the review process.
Figures
read the original abstract
Agentic AI in software product development is increasingly adopted by organizations, yet the field lacks a consolidated synthesis of where adoption is mature, which architectural patterns dominate, and what limitations and coping mechanisms exist in industrial deployments. This systematic literature review addresses these gaps by establishing a body of knowledge as a starting point. Following Kitchenham guidelines, we queried four major research databases, obtaining over 1600 candidate publications. To handle this volume, we developed and validated a domain-agnostic multi-agent screening pipeline that extends prior LLM-assisted review tools by combining automatic metadata curation, inter-agent iterative dialogue, and conflict-resolution defaults that minimize false negatives. From the 92 manually verified primary studies, our thematic synthesis reveals that output verifiability is the primary enabler of agentic adoption: later SDLC phases, whose outputs are objectively evaluable through executable feedback, demonstrate the highest maturity and industrial presence, while earlier phases remain almost exclusively academic proofs-of-concept. We identify the Planner-Executor-Reviewer role specialization as the dominant architectural pattern, with the Reviewer agent implementing verifiability through executable feedback loops. Across all challenge categories, industrial mitigation strategies converge on confining agent actions to verifiable, bounded spaces. This study contributes a comprehensive characterization of the current literature on agentic systems in software product development, and a methodological contribution in the form of an AI-assisted tool to automate the screening phase in high-volume SLR domains.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. This systematic literature review follows Kitchenham guidelines to synthesize agentic AI applications across the SDLC. Querying four databases yielded over 1600 candidates; a domain-agnostic multi-agent screening pipeline (with automatic metadata curation, inter-agent dialogue, and conflict resolution) reduced this to 92 manually verified primary studies. Thematic synthesis concludes that output verifiability is the primary enabler of adoption: later SDLC phases exhibit highest maturity and industrial presence due to executable feedback, while earlier phases remain largely academic proofs-of-concept. The dominant architecture is Planner-Executor-Reviewer specialization, with the Reviewer implementing verifiability via feedback loops. Industrial mitigations converge on confining agents to verifiable bounded spaces. The work also contributes the AI-assisted screening tool for high-volume SLRs.
Significance. If the synthesis holds, the review consolidates knowledge on adoption patterns, architectural dominance, and practical limitations in agentic AI for software development, offering a useful body of knowledge for researchers and practitioners. Explicit credit is due for the methodological contribution of the validated multi-agent screening pipeline, which extends prior LLM-assisted tools and could improve efficiency in future SLRs. The phase-specific maturity distinction, if robust, provides a falsifiable framing for future empirical work on verifiability as an adoption driver.
major comments (2)
- [Methods] Methods section (screening pipeline description): The abstract and methods claim validation of the multi-agent screening pipeline applied to >1600 candidates, yet no quantitative metrics (precision, recall, F1, or inter-rater agreement) or explicit handling of publication bias are reported. This is load-bearing for the central claim, as the representativeness of the final 92 studies directly supports inferences about phase-specific maturity and industrial presence versus academic proofs-of-concept.
- [Results] Results/Thematic synthesis section: The conclusion that output verifiability drives higher maturity in later phases rests on thematic analysis of publication counts and descriptions across the 92 studies, without sensitivity analysis on screening thresholds or external validation against industry surveys. If the pipeline systematically under-samples non-academic or early-phase work (due to terminology or venue differences), the observed pattern may reflect retrieval bias rather than a genuine enabler effect.
minor comments (2)
- [Abstract] Abstract and introduction: The phrasing 'domain-agnostic multi-agent screening pipeline' is introduced without a forward reference to its detailed specification or limitations in the methods; a brief cross-reference would improve readability.
- [Introduction] The paper cites prior LLM-assisted review tools but does not explicitly contrast its conflict-resolution defaults against those baselines; adding one sentence on the incremental extension would clarify novelty.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback and for recognizing the potential impact of our systematic review and the multi-agent screening pipeline. We provide point-by-point responses to the major comments below, indicating where revisions will be made to address the concerns raised.
read point-by-point responses
-
Referee: [Methods] Methods section (screening pipeline description): The abstract and methods claim validation of the multi-agent screening pipeline applied to >1600 candidates, yet no quantitative metrics (precision, recall, F1, or inter-rater agreement) or explicit handling of publication bias are reported. This is load-bearing for the central claim, as the representativeness of the final 92 studies directly supports inferences about phase-specific maturity and industrial presence versus academic proofs-of-concept.
Authors: We agree with the referee that quantitative metrics are important for validating the screening pipeline and supporting the representativeness of the 92 studies. Although the manuscript emphasizes the pipeline's design features for reducing false negatives and the subsequent manual verification, specific performance metrics were not reported in the initial version. In the revised manuscript, we will add these metrics, including precision, recall, F1-score, and inter-rater agreement (e.g., Cohen's kappa) calculated from a sample of papers screened both by the pipeline and human reviewers. We will also include a discussion of publication bias, detailing our search across multiple databases and efforts to include diverse sources, while acknowledging limitations. revision: yes
-
Referee: [Results] Results/Thematic synthesis section: The conclusion that output verifiability drives higher maturity in later phases rests on thematic analysis of publication counts and descriptions across the 92 studies, without sensitivity analysis on screening thresholds or external validation against industry surveys. If the pipeline systematically under-samples non-academic or early-phase work (due to terminology or venue differences), the observed pattern may reflect retrieval bias rather than a genuine enabler effect.
Authors: We take this concern seriously, as it questions whether the phase-specific maturity pattern is robust or an artifact of screening bias. Our synthesis is grounded in the detailed thematic analysis of the included studies, which consistently show greater industrial adoption and verifiability in later SDLC phases. To strengthen this, we will perform and report a sensitivity analysis varying the screening thresholds in the multi-agent pipeline and examining the impact on the observed distributions. We will also compare our findings with external industry surveys on AI adoption in software development to provide additional validation. We will expand the discussion of potential biases in the revised Limitations section. revision: yes
Circularity Check
Minor self-citation in methodological extension; central synthesis remains independent
full rationale
The paper follows standard Kitchenham SLR guidelines to query four databases, applies a multi-agent screening pipeline described as an extension of prior LLM-assisted review tools, manually verifies 92 primary studies, and performs thematic synthesis to identify patterns such as higher maturity in later SDLC phases due to output verifiability. The derivation chain consists of external evidence aggregation rather than any self-referential reduction; the 92 studies are drawn from the indexed literature and the conclusions are inferences from their reported content, not fitted parameters or definitions that presuppose the result. A single reference to extending prior tools constitutes at most a minor non-load-bearing self-citation, consistent with normal scholarly practice and not forcing the phase-maturity claim.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Kitchenham guidelines provide a valid and complete protocol for conducting systematic literature reviews in software engineering
invented entities (1)
-
domain-agnostic multi-agent screening pipeline
no independent evidence
Reference graph
Works this paper leans on
-
[1]
IEEE Access13, 18912–18936 (2025), https://ieeexplore.ieee.org/abstract/document/10849561
Acharya, D.B., Kuppan, K., Divya, B.: Agentic AI: Autonomous Intelligence for Complex Goals—A Comprehensive Survey. IEEE Access13, 18912–18936 (2025), https://ieeexplore.ieee.org/abstract/document/10849561
-
[2]
Adapa, C., Anjana, A., Rahim, R., Victor, A.: A multi-agent ai framework for agile workflow automation, issue resolution, and developer performance evaluation. In: 2025 IEEE International Conference for Women in Innovation, Technology & Entrepreneurship (ICWITE). pp. 1–6. IEEE (2025)
work page 2025
-
[3]
Akbar, M.A., Khan, A.A., Hamza, M., et al.: Agentic AI in Software Engineering: Practitioner Perspectives Across the Software Development Life Cycle (Sep 2025), https://papers.ssrn.com/abstract=5520159
work page 2025
-
[4]
Future Internet17(9) (Sep 2025),https://www.mdpi.com/1999-5903/ 17/9/404
Bandi, A., Kongari, B., Naguru, R., et al.: The Rise of Agentic AI: A Review of Definitions, Frameworks, Architectures, Applications, Evaluation Metrics, and Challenges. Future Internet17(9) (Sep 2025),https://www.mdpi.com/1999-5903/ 17/9/404
work page 2025
-
[5]
Becker, J., Rush, N., Barnes, E., Rein, D.: Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity (Jul 2025),http://arxiv. org/abs/2507.09089
-
[6]
Organization Studies29(3), 393–413 (Mar 2008),https:// doi.org/10.1177/0170840607088020
Denyer, D., Tranfield, D., van Aken, J.E.: Developing Design Propositions through Research Synthesis. Organization Studies29(3), 393–413 (Mar 2008),https:// doi.org/10.1177/0170840607088020
-
[7]
Gough, D., Thomas, J., Oliver, S.: An introduction to systematic reviews. SAGE (2017)
work page 2017
-
[8]
Gusenbauer, M., Haddaway, N.R.: Which academic search systems are suitable for systematic reviews or meta-analyses? Research Synthesis Methods11(2), 181–217 (2020),https://onlinelibrary.wiley.com/doi/abs/10.1002/jrsm.1378
- [9]
- [10]
-
[11]
Array26, 100399 (Jul 2025), https://www.sciencedirect
Hosseini, S., Seilani, H.: The role of agentic AI in shaping a smart future: A systematic review. Array26, 100399 (Jul 2025), https://www.sciencedirect. com/science/article/pii/S2590005625000268
work page 2025
- [12]
-
[13]
Electronics14(15), 3008 (Jan 2025),https://www.mdpi.com/2079-9292/14/15/3008
Ji, X., Zhang, L., Zhang, W., et al.: LEMAD: LLM-Empowered Multi-Agent System for Anomaly Detection in Power Grid Services. Electronics14(15), 3008 (Jan 2025),https://www.mdpi.com/2079-9292/14/15/3008
work page 2025
- [14]
- [15]
-
[16]
Kitchenham, B., Charters, S., et al.: Guidelines for performing systematic literature reviews in software engineering. Keele (2007)
work page 2007
- [17]
-
[18]
Liu, J., Wang, K., Chen, Y., et al.: Large Language Model-Based Agents for Software Engineering: A Survey (Dec 2025),http://arxiv.org/abs/2409.02977
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[19]
In: 2026 IEEE 5th International Conference on AI in Cybersecurity (ICAIC)
Mavani, V.K.: Codebase Aware Generative Agents for the SDLC: Automating Documentation, Dependency Analysis and Test Generation. In: 2026 IEEE 5th International Conference on AI in Cybersecurity (ICAIC). pp. 1–4 (Feb 2026), https://ieeexplore.ieee.org/document/11395666
-
[20]
Murali, V., Maddila, C., Ahmad, I., et al.: AI-Assisted Code Authoring at Scale: Fine-Tuning, Deploying, and Mixed Methods Evaluation. Proc. ACM Softw. Eng. 1(FSE), 48:1066–48:1085 (Jul 2024),https://dl.acm.org/doi/10.1145/3643774
-
[21]
IEEE Access14, 7443–7465 (2026),https:// ieeexplore.ieee.org/abstract/document/11343819
Otoum, N., Elkhalili, N.: Methods and Techniques of Agentic Software Engineering: A Systematic Literature Review. IEEE Access14, 7443–7465 (2026),https:// ieeexplore.ieee.org/abstract/document/11343819
-
[22]
Systematic Reviews5(1), 210 (Dec 2016), https://doi.org/10.1186/s13643-016-0384-4
Ouzzani, M., Hammady, H., Fedorowicz, Z., Elmagarmid, A.: Rayyan—a web and mobile app for systematic reviews. Systematic Reviews5(1), 210 (Dec 2016), https://doi.org/10.1186/s13643-016-0384-4
-
[23]
Peng, S., Kalliamvakou, E., Cihon, P., Demirer, M.: The Impact of AI on Developer Productivity: Evidence from GitHub Copilot (Feb 2023),http://arxiv.org/abs/ 2302.06590
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[24]
Qin, Y., Wang, S., Lou, Y., et al.: SoapFL: A Standard Operating Procedure for LLM-Based Method-Level Fault Localization. IEEE Transactions on Software Engi- neering51(4), 1173–1187 (Apr 2025),https://ieeexplore.ieee.org/document/ 10891926
work page 2025
-
[25]
In: 2025 IEEE International Conference on Electro Information Tech- nology (eIT)
Raheem, T., Hossain, G.: Agentic AI Systems: Opportunities, Challenges, and Trust- worthiness. In: 2025 IEEE International Conference on Electro Information Tech- nology (eIT). pp. 618–624 (May 2025),https://ieeexplore.ieee.org/abstract/ document/11103638
- [26]
-
[27]
Agentic AI: A Conceptual Taxonomy, Applications and Challenges
Sapkota, R., Roumeliotis, K.I., Karkee, M.: AI Agents vs. Agentic AI: A Conceptual Taxonomy, Applications and Challenges. Information Fusion126, 103599 (Feb 2026),http://arxiv.org/abs/2505.10468
- [28]
-
[29]
Tahat, A., Amundson, I., Hardin, D., Cofer, D.: Agree-dog copilot: a neuro-symbolic approachtoenhancedmodel-basedsystemsengineering.In:InternationalConference on Bridging the Gap between AI and Reality. pp. 117–137. Springer (2025)
work page 2025
-
[30]
In: Proceedings of the 2nd ACM SIGHIT International Health Informatics Symposium
Wallace, B.C., Small, K., Brodley, C.E., et al.: Deploying an interactive machine learning system in an evidence-based practice center: abstrackr. In: Proceedings of the 2nd ACM SIGHIT International Health Informatics Symposium. pp. 819–824. IHI ’12, Association for Computing Machinery, New York, NY, USA (Jan 2012), https://dl.acm.org/doi/10.1145/2110363.2110464
-
[31]
Wang, L., Ma, C., Feng, X., et al.: A survey on large language model based autonomous agents. Frontiers of Computer Science18(6), 186345 (Mar 2024), https://doi.org/10.1007/s11704-024-40231-1
- [32]
- [33]
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.