pith. sign in

arxiv: 2605.15245 · v1 · pith:4A66SVTMnew · submitted 2026-05-14 · 💻 cs.SE

Assistance to Autonomy: A Systematic Literature Review of Agentic AI across the Software Development Life Cycle

Pith reviewed 2026-05-19 16:22 UTC · model grok-4.3

classification 💻 cs.SE
keywords agentic AIsoftware development life cyclesystematic literature reviewoutput verifiabilityAI agentsplanner executor reviewerindustrial adoptionSDLC phases
0
0 comments X

The pith

Output verifiability enables agentic AI adoption mainly in later software development phases.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The review synthesizes evidence that agentic AI systems reach industrial maturity where outputs can be checked objectively through execution or tests. Earlier phases such as requirements and design stay mostly academic because their outputs lack this direct feedback. The work shows that teams cope with limitations by restricting agents to bounded, verifiable actions. A reader seeking practical guidance would learn which parts of the development process are ready for agent deployment now. The authors also introduce a multi-agent screening method to handle large numbers of papers efficiently.

Core claim

Output verifiability is the primary enabler of agentic adoption: later SDLC phases, whose outputs are objectively evaluable through executable feedback, demonstrate the highest maturity and industrial presence, while earlier phases remain almost exclusively academic proofs-of-concept. The Planner-Executor-Reviewer role specialization is the dominant architectural pattern, with the Reviewer agent implementing verifiability through executable feedback loops. Across all challenge categories, industrial mitigation strategies converge on confining agent actions to verifiable, bounded spaces.

What carries the argument

The Planner-Executor-Reviewer role specialization, which divides responsibilities among agents so that a reviewer can apply executable feedback to confirm task completion.

Load-bearing premise

The 92 manually verified primary studies, selected after multi-agent screening of over 1600 candidates, form a representative sample that accurately captures dominant patterns in both academic and industrial agentic AI use.

What would settle it

Discovery of multiple documented industrial deployments of agentic AI in requirements engineering or high-level design that were missed by the review process.

Figures

Figures reproduced from arXiv: 2605.15245 by Helena Holmstr\"om Olsson, Jan Bosch, Spyridon Alvanakis Apostolou.

Figure 1
Figure 1. Figure 1: Analytical workflow of the multi-agent collaboration and consensus mechanism. minimal reviewer input, raw publication metadata (title, DOI, abstract, etc.) and a single detailed prompt containing the research purpose, research questions, and selection criteria. The pipeline includes self-curation of missing abstracts, inter-agent classification discussions, and produces binary relevant/irrelevant labels wi… view at source ↗
read the original abstract

Agentic AI in software product development is increasingly adopted by organizations, yet the field lacks a consolidated synthesis of where adoption is mature, which architectural patterns dominate, and what limitations and coping mechanisms exist in industrial deployments. This systematic literature review addresses these gaps by establishing a body of knowledge as a starting point. Following Kitchenham guidelines, we queried four major research databases, obtaining over 1600 candidate publications. To handle this volume, we developed and validated a domain-agnostic multi-agent screening pipeline that extends prior LLM-assisted review tools by combining automatic metadata curation, inter-agent iterative dialogue, and conflict-resolution defaults that minimize false negatives. From the 92 manually verified primary studies, our thematic synthesis reveals that output verifiability is the primary enabler of agentic adoption: later SDLC phases, whose outputs are objectively evaluable through executable feedback, demonstrate the highest maturity and industrial presence, while earlier phases remain almost exclusively academic proofs-of-concept. We identify the Planner-Executor-Reviewer role specialization as the dominant architectural pattern, with the Reviewer agent implementing verifiability through executable feedback loops. Across all challenge categories, industrial mitigation strategies converge on confining agent actions to verifiable, bounded spaces. This study contributes a comprehensive characterization of the current literature on agentic systems in software product development, and a methodological contribution in the form of an AI-assisted tool to automate the screening phase in high-volume SLR domains.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. This systematic literature review follows Kitchenham guidelines to synthesize agentic AI applications across the SDLC. Querying four databases yielded over 1600 candidates; a domain-agnostic multi-agent screening pipeline (with automatic metadata curation, inter-agent dialogue, and conflict resolution) reduced this to 92 manually verified primary studies. Thematic synthesis concludes that output verifiability is the primary enabler of adoption: later SDLC phases exhibit highest maturity and industrial presence due to executable feedback, while earlier phases remain largely academic proofs-of-concept. The dominant architecture is Planner-Executor-Reviewer specialization, with the Reviewer implementing verifiability via feedback loops. Industrial mitigations converge on confining agents to verifiable bounded spaces. The work also contributes the AI-assisted screening tool for high-volume SLRs.

Significance. If the synthesis holds, the review consolidates knowledge on adoption patterns, architectural dominance, and practical limitations in agentic AI for software development, offering a useful body of knowledge for researchers and practitioners. Explicit credit is due for the methodological contribution of the validated multi-agent screening pipeline, which extends prior LLM-assisted tools and could improve efficiency in future SLRs. The phase-specific maturity distinction, if robust, provides a falsifiable framing for future empirical work on verifiability as an adoption driver.

major comments (2)
  1. [Methods] Methods section (screening pipeline description): The abstract and methods claim validation of the multi-agent screening pipeline applied to >1600 candidates, yet no quantitative metrics (precision, recall, F1, or inter-rater agreement) or explicit handling of publication bias are reported. This is load-bearing for the central claim, as the representativeness of the final 92 studies directly supports inferences about phase-specific maturity and industrial presence versus academic proofs-of-concept.
  2. [Results] Results/Thematic synthesis section: The conclusion that output verifiability drives higher maturity in later phases rests on thematic analysis of publication counts and descriptions across the 92 studies, without sensitivity analysis on screening thresholds or external validation against industry surveys. If the pipeline systematically under-samples non-academic or early-phase work (due to terminology or venue differences), the observed pattern may reflect retrieval bias rather than a genuine enabler effect.
minor comments (2)
  1. [Abstract] Abstract and introduction: The phrasing 'domain-agnostic multi-agent screening pipeline' is introduced without a forward reference to its detailed specification or limitations in the methods; a brief cross-reference would improve readability.
  2. [Introduction] The paper cites prior LLM-assisted review tools but does not explicitly contrast its conflict-resolution defaults against those baselines; adding one sentence on the incremental extension would clarify novelty.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback and for recognizing the potential impact of our systematic review and the multi-agent screening pipeline. We provide point-by-point responses to the major comments below, indicating where revisions will be made to address the concerns raised.

read point-by-point responses
  1. Referee: [Methods] Methods section (screening pipeline description): The abstract and methods claim validation of the multi-agent screening pipeline applied to >1600 candidates, yet no quantitative metrics (precision, recall, F1, or inter-rater agreement) or explicit handling of publication bias are reported. This is load-bearing for the central claim, as the representativeness of the final 92 studies directly supports inferences about phase-specific maturity and industrial presence versus academic proofs-of-concept.

    Authors: We agree with the referee that quantitative metrics are important for validating the screening pipeline and supporting the representativeness of the 92 studies. Although the manuscript emphasizes the pipeline's design features for reducing false negatives and the subsequent manual verification, specific performance metrics were not reported in the initial version. In the revised manuscript, we will add these metrics, including precision, recall, F1-score, and inter-rater agreement (e.g., Cohen's kappa) calculated from a sample of papers screened both by the pipeline and human reviewers. We will also include a discussion of publication bias, detailing our search across multiple databases and efforts to include diverse sources, while acknowledging limitations. revision: yes

  2. Referee: [Results] Results/Thematic synthesis section: The conclusion that output verifiability drives higher maturity in later phases rests on thematic analysis of publication counts and descriptions across the 92 studies, without sensitivity analysis on screening thresholds or external validation against industry surveys. If the pipeline systematically under-samples non-academic or early-phase work (due to terminology or venue differences), the observed pattern may reflect retrieval bias rather than a genuine enabler effect.

    Authors: We take this concern seriously, as it questions whether the phase-specific maturity pattern is robust or an artifact of screening bias. Our synthesis is grounded in the detailed thematic analysis of the included studies, which consistently show greater industrial adoption and verifiability in later SDLC phases. To strengthen this, we will perform and report a sensitivity analysis varying the screening thresholds in the multi-agent pipeline and examining the impact on the observed distributions. We will also compare our findings with external industry surveys on AI adoption in software development to provide additional validation. We will expand the discussion of potential biases in the revised Limitations section. revision: yes

Circularity Check

0 steps flagged

Minor self-citation in methodological extension; central synthesis remains independent

full rationale

The paper follows standard Kitchenham SLR guidelines to query four databases, applies a multi-agent screening pipeline described as an extension of prior LLM-assisted review tools, manually verifies 92 primary studies, and performs thematic synthesis to identify patterns such as higher maturity in later SDLC phases due to output verifiability. The derivation chain consists of external evidence aggregation rather than any self-referential reduction; the 92 studies are drawn from the indexed literature and the conclusions are inferences from their reported content, not fitted parameters or definitions that presuppose the result. A single reference to extending prior tools constitutes at most a minor non-load-bearing self-citation, consistent with normal scholarly practice and not forcing the phase-maturity claim.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central synthesis rests on the representativeness of the screened literature and the fidelity of the thematic analysis; the methodological contribution introduces a new screening pipeline whose validation details are not supplied in the abstract.

axioms (1)
  • domain assumption Kitchenham guidelines provide a valid and complete protocol for conducting systematic literature reviews in software engineering
    Invoked to justify the overall review process and search strategy.
invented entities (1)
  • domain-agnostic multi-agent screening pipeline no independent evidence
    purpose: Automate initial screening of large candidate sets while minimizing false negatives through inter-agent dialogue and conflict resolution
    Presented as an extension of prior LLM-assisted tools; no independent evidence of correctness beyond the authors' validation claim is given in the abstract.

pith-pipeline@v0.9.0 · 5793 in / 1287 out tokens · 57186 ms · 2026-05-19T16:22:24.733504+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

33 extracted references · 33 canonical work pages · 2 internal anchors

  1. [1]

    IEEE Access13, 18912–18936 (2025), https://ieeexplore.ieee.org/abstract/document/10849561

    Acharya, D.B., Kuppan, K., Divya, B.: Agentic AI: Autonomous Intelligence for Complex Goals—A Comprehensive Survey. IEEE Access13, 18912–18936 (2025), https://ieeexplore.ieee.org/abstract/document/10849561

  2. [2]

    In: 2025 IEEE International Conference for Women in Innovation, Technology & Entrepreneurship (ICWITE)

    Adapa, C., Anjana, A., Rahim, R., Victor, A.: A multi-agent ai framework for agile workflow automation, issue resolution, and developer performance evaluation. In: 2025 IEEE International Conference for Women in Innovation, Technology & Entrepreneurship (ICWITE). pp. 1–6. IEEE (2025)

  3. [3]

    Akbar, M.A., Khan, A.A., Hamza, M., et al.: Agentic AI in Software Engineering: Practitioner Perspectives Across the Software Development Life Cycle (Sep 2025), https://papers.ssrn.com/abstract=5520159

  4. [4]

    Future Internet17(9) (Sep 2025),https://www.mdpi.com/1999-5903/ 17/9/404

    Bandi, A., Kongari, B., Naguru, R., et al.: The Rise of Agentic AI: A Review of Definitions, Frameworks, Architectures, Applications, Evaluation Metrics, and Challenges. Future Internet17(9) (Sep 2025),https://www.mdpi.com/1999-5903/ 17/9/404

  5. [5]

    ArXiv , year=

    Becker, J., Rush, N., Barnes, E., Rein, D.: Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity (Jul 2025),http://arxiv. org/abs/2507.09089

  6. [6]

    Organization Studies29(3), 393–413 (Mar 2008),https:// doi.org/10.1177/0170840607088020

    Denyer, D., Tranfield, D., van Aken, J.E.: Developing Design Propositions through Research Synthesis. Organization Studies29(3), 393–413 (Mar 2008),https:// doi.org/10.1177/0170840607088020

  7. [7]

    SAGE (2017)

    Gough, D., Thomas, J., Oliver, S.: An introduction to systematic reviews. SAGE (2017)

  8. [8]

    Gusenbauer, M., Haddaway, N.R.: Which academic search systems are suitable for systematic reviews or meta-analyses? Research Synthesis Methods11(2), 181–217 (2020),https://onlinelibrary.wiley.com/doi/abs/10.1002/jrsm.1378

  9. [9]

    Hariharan, M., Arvapalli, S., Barma, S., Sheela, E.: Agentic RAG for Software Testing with Hybrid Vector-Graph and Multi-Agent Orchestration (Oct 2025), http://arxiv.org/abs/2510.10824

  10. [10]

    He, J., Treude, C., Lo, D.: LLM-Based Multi-Agent Systems for Software En- gineering: Literature Review, Vision and the Road Ahead (Jul 2025), http: //arxiv.org/abs/2404.04834

  11. [11]

    Array26, 100399 (Jul 2025), https://www.sciencedirect

    Hosseini, S., Seilani, H.: The role of agentic AI in shaping a smart future: A systematic review. Array26, 100399 (Jul 2025), https://www.sciencedirect. com/science/article/pii/S2590005625000268

  12. [12]

    Hu, Y., Cai, Y., Du, Y., et al.: Self-Evolving Multi-Agent Collaboration Networks for Software Development (Oct 2024),http://arxiv.org/abs/2410.16946

  13. [13]

    Electronics14(15), 3008 (Jan 2025),https://www.mdpi.com/2079-9292/14/15/3008

    Ji, X., Zhang, L., Zhang, W., et al.: LEMAD: LLM-Empowered Multi-Agent System for Anomaly Detection in Power Grid Services. Electronics14(15), 3008 (Jan 2025),https://www.mdpi.com/2079-9292/14/15/3008

  14. [14]

    Jin, H., Sun, Z., Chen, H.: RGD: Multi-LLM Based Agent Debugger via Refinement and Generation Guidance (Oct 2024),http://arxiv.org/abs/2410.01242

  15. [15]

    Khoee, A.G., Yu, Y., Feldt, R., et al.: GoNoGo: An Efficient LLM-based Multi- Agent System for Streamlining Automotive Software Release Decision-Making (Sep 2024),http://arxiv.org/abs/2408.09785

  16. [16]

    Keele (2007)

    Kitchenham, B., Charters, S., et al.: Guidelines for performing systematic literature reviews in software engineering. Keele (2007)

  17. [17]

    Kohl, J., Kruse, O., Mostafa, Y., et al.: Automated structural testing of LLM-based agents: methods, framework, and case studies (Jan 2026),http://arxiv.org/abs/ 2601.18827 16 Spyridon Alvanakis Apostolou, Jan Bosch, and Helena Holmström Olsson

  18. [18]

    Liu, J., Wang, K., Chen, Y., et al.: Large Language Model-Based Agents for Software Engineering: A Survey (Dec 2025),http://arxiv.org/abs/2409.02977

  19. [19]

    In: 2026 IEEE 5th International Conference on AI in Cybersecurity (ICAIC)

    Mavani, V.K.: Codebase Aware Generative Agents for the SDLC: Automating Documentation, Dependency Analysis and Test Generation. In: 2026 IEEE 5th International Conference on AI in Cybersecurity (ICAIC). pp. 1–4 (Feb 2026), https://ieeexplore.ieee.org/document/11395666

  20. [20]

    Murali, V., Maddila, C., Ahmad, I., et al.: AI-Assisted Code Authoring at Scale: Fine-Tuning, Deploying, and Mixed Methods Evaluation. Proc. ACM Softw. Eng. 1(FSE), 48:1066–48:1085 (Jul 2024),https://dl.acm.org/doi/10.1145/3643774

  21. [21]

    IEEE Access14, 7443–7465 (2026),https:// ieeexplore.ieee.org/abstract/document/11343819

    Otoum, N., Elkhalili, N.: Methods and Techniques of Agentic Software Engineering: A Systematic Literature Review. IEEE Access14, 7443–7465 (2026),https:// ieeexplore.ieee.org/abstract/document/11343819

  22. [22]

    Systematic Reviews5(1), 210 (Dec 2016), https://doi.org/10.1186/s13643-016-0384-4

    Ouzzani, M., Hammady, H., Fedorowicz, Z., Elmagarmid, A.: Rayyan—a web and mobile app for systematic reviews. Systematic Reviews5(1), 210 (Dec 2016), https://doi.org/10.1186/s13643-016-0384-4

  23. [23]

    Peng, S., Kalliamvakou, E., Cihon, P., Demirer, M.: The Impact of AI on Developer Productivity: Evidence from GitHub Copilot (Feb 2023),http://arxiv.org/abs/ 2302.06590

  24. [24]

    IEEE Transactions on Software Engi- neering51(4), 1173–1187 (Apr 2025),https://ieeexplore.ieee.org/document/ 10891926

    Qin, Y., Wang, S., Lou, Y., et al.: SoapFL: A Standard Operating Procedure for LLM-Based Method-Level Fault Localization. IEEE Transactions on Software Engi- neering51(4), 1173–1187 (Apr 2025),https://ieeexplore.ieee.org/document/ 10891926

  25. [25]

    In: 2025 IEEE International Conference on Electro Information Tech- nology (eIT)

    Raheem, T., Hossain, G.: Agentic AI Systems: Opportunities, Challenges, and Trust- worthiness. In: 2025 IEEE International Conference on Electro Information Tech- nology (eIT). pp. 618–624 (May 2025),https://ieeexplore.ieee.org/abstract/ document/11103638

  26. [26]

    Rouzrokh, P., Khosravi, B., Rouzrokh, P., Shariatnia, M.: LatteReview: A Multi- Agent Framework for Systematic Review Automation Using Large Language Models (Oct 2025),http://arxiv.org/abs/2501.05468

  27. [27]

    Agentic AI: A Conceptual Taxonomy, Applications and Challenges

    Sapkota, R., Roumeliotis, K.I., Karkee, M.: AI Agents vs. Agentic AI: A Conceptual Taxonomy, Applications and Challenges. Information Fusion126, 103599 (Feb 2026),http://arxiv.org/abs/2505.10468

  28. [28]

    Schneider, J.: Generative to Agentic AI: Survey, Conceptualization, and Challenges (Apr 2025),http://arxiv.org/abs/2504.18875

  29. [29]

    Tahat, A., Amundson, I., Hardin, D., Cofer, D.: Agree-dog copilot: a neuro-symbolic approachtoenhancedmodel-basedsystemsengineering.In:InternationalConference on Bridging the Gap between AI and Reality. pp. 117–137. Springer (2025)

  30. [30]

    In: Proceedings of the 2nd ACM SIGHIT International Health Informatics Symposium

    Wallace, B.C., Small, K., Brodley, C.E., et al.: Deploying an interactive machine learning system in an evidence-based practice center: abstrackr. In: Proceedings of the 2nd ACM SIGHIT International Health Informatics Symposium. pp. 819–824. IHI ’12, Association for Computing Machinery, New York, NY, USA (Jan 2012), https://dl.acm.org/doi/10.1145/2110363.2110464

  31. [31]

    Frontiers Comput

    Wang, L., Ma, C., Feng, X., et al.: A survey on large language model based autonomous agents. Frontiers of Computer Science18(6), 186345 (Mar 2024), https://doi.org/10.1007/s11704-024-40231-1

  32. [32]

    Wang, Y., Zhong, W., Huang, Y., et al.: Agents in Software Engineering: Survey, Landscape, and Vision (Sep 2024),http://arxiv.org/abs/2409.09030

  33. [33]

    Yetiştiren, B., Özsoy, I., Ayerdem, M., Tüzün, E.: Evaluating the Code Quality of AI- Assisted Code Generation Tools: An Empirical Study on GitHub Copilot, Amazon CodeWhisperer, and ChatGPT (Oct 2023),http://arxiv.org/abs/2304.10778