Adaptive and AI-Augmented Security Testing: A Systematic Survey of Program Analysis, Feedback-Driven Testing, and Hybrid Learning-Based Approaches
Pith reviewed 2026-05-07 13:27 UTC · model grok-4.3
The pith
Security testing research shows a persistent disconnect between structural program analysis and adaptive feedback mechanisms.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Analysis of the fifty-five studies reveals a persistent disconnect between structural program representations such as ASTs, CFGs, and CPGs and adaptive testing mechanisms that the authors term structural-adaptive fragmentation. Neither paradigm alone resolves the separation, and no existing system incorporates human triage signals as feedback for refining structural models. The survey identifies five open research challenges and outlines an agenda for unified, semantically grounded, feedback-driven, polyglot security testing frameworks.
What carries the argument
Structural-adaptive fragmentation: the systematic separation between structural program representations (ASTs, CFGs, CPGs) and adaptive testing mechanisms that neither paradigm individually addresses.
If this is right
- Hybrid systems that combine program analysis with adaptive learning can reduce reliance on non-adaptive workflows in continuous security testing.
- Incorporating execution feedback and human triage into structural models would lower the volume of manual warnings in CI/CD pipelines.
- A unified framework supporting semantically grounded, feedback-driven testing would enable more effective vulnerability detection across multiple languages.
- Progress on the five identified open challenges would move the field toward integrated rather than fragmented security testing approaches.
Where Pith is reading between the lines
- Closing the identified gap could produce testing tools that automatically adjust structural models based on both runtime signals and human judgments, reducing false positives over time.
- The fragmentation pattern may appear in other software engineering domains where static analysis outputs feed into dynamic or learning-based processes without feedback loops.
- A practical next step would be to prototype a system that routes human triage decisions back into structural representations and measure changes in warning volume or detection accuracy.
Load-bearing premise
The fifty-five peer-reviewed studies selected from the systematic search of four databases represent the field without major bias in inclusion or categorization.
What would settle it
Identification of even one existing system that incorporates human triage signals as feedback to refine structural program models would falsify the claim that no such system exists.
Figures
read the original abstract
Modern software systems are increasingly developed within rapid continuous integration and deployment (CI/CD) pipelines, where ensuring security prior to release presents significant technical and organizational challenges. Traditional static and dynamic analysis tools provide valuable structural and behavioral insights, yet they often operate in non-adaptive workflows and produce large volumes of warnings requiring manual triage. Feedback-driven fuzzing and search-based testing approaches have demonstrated the power of iterative input refinement guided by execution signals, while large language models (LLMs) have shown promise in automated test generation but frequently lack semantic grounding in program structure. This paper presents a systematic survey of adaptive and AI-augmented security testing research across five domains: (1) structural program analysis for vulnerability detection, (2) DevSecOps and continuous security testing, (3) feedback-driven fuzzing and search-based testing, (4) LLM-based automated test generation, and (5) emerging hybrid systems integrating program analysis with adaptive learning. We analyze fifty-five peer-reviewed studies drawn from a systematic search of four major databases yielding 22,088 raw records. Our analysis reveals a persistent disconnect between structural program representations (ASTs, CFGs, and CPGs) and adaptive testing mechanisms. We characterize this as structural-adaptive fragmentation: a systematic separation that neither paradigm individually addresses. No existing system incorporates human triage signals as feedback for refining structural models. We conclude by identifying five open research challenges and outlining a unified agenda for semantically grounded, feedback-driven, polyglot security testing frameworks.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. This paper presents a systematic survey of adaptive and AI-augmented security testing research across five domains: (1) structural program analysis for vulnerability detection, (2) DevSecOps and continuous security testing, (3) feedback-driven fuzzing and search-based testing, (4) LLM-based automated test generation, and (5) emerging hybrid systems. Drawing on 55 peer-reviewed studies identified from a search of four databases that returned 22,088 raw records, the authors identify a persistent 'structural-adaptive fragmentation' between structural representations (ASTs, CFGs, CPGs) and adaptive testing mechanisms, assert that no existing system incorporates human triage signals as feedback for refining structural models, and outline five open research challenges for unified, semantically grounded frameworks.
Significance. If the survey's selection and synthesis are reproducible and unbiased, the work is significant for mapping fragmentation in the field and highlighting the absence of human-in-the-loop refinement loops. It provides a clear research agenda that could guide development of hybrid tools integrating program structure with feedback-driven and LLM-based methods, addressing real challenges in CI/CD security testing.
major comments (1)
- [Systematic review methodology (search strategy, study selection, and synthesis sections)] The central claims of structural-adaptive fragmentation and that 'no existing system incorporates human triage signals as feedback for refining structural models' rest entirely on the qualitative synthesis of the 55 selected studies. However, the manuscript does not provide the explicit search strings, inclusion/exclusion criteria, data extraction protocol, quality assessment criteria, or inter-rater agreement metrics used to reduce 22,088 records to 55 studies (see the systematic search description). Without these details, it is impossible to assess whether relevant hybrid or feedback-refinement papers were under-retrieved or mis-categorized, making the fragmentation characterization unverifiable.
minor comments (1)
- [Abstract] The abstract lists five domains but the full enumeration in the text should be cross-checked for exact alignment with the later analysis sections to avoid minor reader confusion.
Simulated Author's Rebuttal
We thank the referee for their thorough and constructive review. The feedback highlights an important opportunity to improve the transparency of our systematic review process, and we will revise the manuscript accordingly to strengthen reproducibility while preserving the core contributions.
read point-by-point responses
-
Referee: [Systematic review methodology (search strategy, study selection, and synthesis sections)] The central claims of structural-adaptive fragmentation and that 'no existing system incorporates human triage signals as feedback for refining structural models' rest entirely on the qualitative synthesis of the 55 selected studies. However, the manuscript does not provide the explicit search strings, inclusion/exclusion criteria, data extraction protocol, quality assessment criteria, or inter-rater agreement metrics used to reduce 22,088 records to 55 studies (see the systematic search description). Without these details, it is impossible to assess whether relevant hybrid or feedback-refinement papers were under-retrieved or mis-categorized, making the fragmentation characterization unverifiable.
Authors: We agree that the current manuscript provides only a high-level overview of the search process and lacks the granular protocol details needed for full reproducibility. In the revised version we will insert a dedicated 'Systematic Review Methodology' subsection (following PRISMA guidelines) that explicitly lists: (1) the complete search strings used in each of the four databases, (2) the full inclusion/exclusion criteria applied at each screening stage, (3) the data extraction protocol and form, (4) the quality assessment criteria (including scoring rubrics), and (5) inter-rater agreement statistics (Cohen's kappa) for both title/abstract and full-text screening. These details were recorded during the original review and will be reported without changing the set of 55 studies or the resulting synthesis. We believe this addition will allow independent verification of the fragmentation characterization and the claim regarding the absence of human-triage feedback loops. revision: yes
Circularity Check
No significant circularity in survey synthesis
full rationale
The paper is a systematic literature survey whose central claims (structural-adaptive fragmentation and absence of human-triage feedback loops) are synthesized from qualitative review of 55 externally identified peer-reviewed studies. No mathematical derivations, parameter fittings, self-referential predictions, or load-bearing self-citations appear in the derivation chain. The survey process applies standard database search and inclusion criteria to independent external sources, making the analysis self-contained against literature benchmarks without reduction to internal definitions or author priors.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Usage, costs, and benefits of continuous integration in open-source projects,
M. Hilton, T. Tunnell, K. Huang, D. Marinov, and D. Dig, “Usage, costs, and benefits of continuous integration in open-source projects,” in Proceedings of the IEEE/ACM International Conference on Automated Software Engineering (ASE), 2016
2016
-
[2]
What developers want and need from program analysis,
M. Christakis and C. Bird, “What developers want and need from program analysis,” inProceedings of the IEEE/ACM International Conference on Automated Software Engineering (ASE), 2016
2016
-
[3]
An empirical study of security warnings from static application security testing tools,
B. Aloraini, M. Nagappan, D. German, A. Zerouali, and G. Robles, “An empirical study of security warnings from static application security testing tools,”Journal of Systems and Software, vol. 158, 2019
2019
-
[4]
Tricorder: Building a program analysis ecosystem,
C. Sadowski, J. van Gogh, C. Jaspan, E. Söderberg, and C. Winter, “Tricorder: Building a program analysis ecosystem,” inProceedings of the International Conference on Software Engineering (ICSE), 2015, pp. 598–608
2015
-
[5]
Moving fast with software verification,
C. Calcagnoet al., “Moving fast with software verification,” inPro- ceedings of the NASA Formal Methods Symposium (NFM), ser. Lecture Notes in Computer Science, vol. 9058, 2015, pp. 3–11
2015
-
[6]
Modeling and discov- ering vulnerabilities with code property graphs,
F. Yamaguchi, N. Golde, D. Arp, and K. Rieck, “Modeling and discov- ering vulnerabilities with code property graphs,” inProceedings of the IEEE Symposium on Security and Privacy (S&P), 2014, pp. 590–604
2014
-
[7]
Declarative static analysis for multilingual programs using CodeQL,
J. Younet al., “Declarative static analysis for multilingual programs using CodeQL,”Software: Practice and Experience, vol. 53, no. 2, 2023
2023
-
[8]
The art, science, and engineering of fuzzing: A survey,
V . J. M. Manèset al., “The art, science, and engineering of fuzzing: A survey,”IEEE Transactions on Software Engineering, vol. 47, no. 11, pp. 2312–2331, 2021
2021
-
[9]
Directed greybox fuzzing,
M. Böhme, V .-T. Pham, M.-D. Nguyen, and A. Roychoudhury, “Directed greybox fuzzing,” inProceedings of the ACM Conference on Computer and Communications Security (CCS), 2017, pp. 2329–2344
2017
-
[10]
Driller: Augmenting fuzzing through selective symbolic execution,
N. Stephenset al., “Driller: Augmenting fuzzing through selective symbolic execution,” inProceedings of the NDSS Symposium, 2016
2016
-
[11]
An empirical evaluation of using large language models for automated unit test generation,
M. Schäfer, S. Nadi, A. Eghbali, and F. Tip, “An empirical evaluation of using large language models for automated unit test generation,”IEEE Transactions on Software Engineering, vol. 50, no. 1, 2024
2024
-
[12]
White-box compiler fuzzing empowered by large language models,
C. Yanget al., “White-box compiler fuzzing empowered by large language models,”Proceedings of the ACM on Programming Languages (OOPSLA), vol. 8, 2024
2024
-
[13]
Joern: Efficient mining of software vulnerabilities with interprocedural data- flow graphs,
F. Yamaguchi, C. Wressnegger, H. Gascon, and K. Rieck, “Joern: Efficient mining of software vulnerabilities with interprocedural data- flow graphs,” inProceedings of the DIMVA, 2014
2014
-
[14]
Securify: Practical security analysis of smart contracts,
P. Tsankov, A. Dan, D. Drachsler-Cohen, A. Gervais, F. Buenzli, and M. Vechev, “Securify: Practical security analysis of smart contracts,” in Proceedings of the ACM Conference on Computer and Communications Security (CCS), 2018
2018
-
[15]
VulDeePecker: A deep learning-based system for vulner- ability detection,
Z. Liet al., “VulDeePecker: A deep learning-based system for vulner- ability detection,” inProceedings of the NDSS Symposium, 2018
2018
-
[16]
Devign: Effective vulnerability identification by learning comprehensive program seman- tics via graph neural networks,
S. Chakraborty, R. Krishna, Y . Ding, and B. Ray, “Devign: Effective vulnerability identification by learning comprehensive program seman- tics via graph neural networks,” inProceedings of the Conference on Neural Information Processing Systems (NeurIPS), 2019
2019
-
[17]
ASTER: Natural and multi-language unit test generation with LLMs,
R. Pan, M. Kim, R. Krishna, R. Pavuluri, and S. Sinha, “ASTER: Natural and multi-language unit test generation with LLMs,” inProceedings of the IEEE/ACM International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP), 2025
2025
-
[18]
HITS: High-coverage LLM- based unit test generation via method slicing,
Z. Wang, K. Liu, G. Li, and Z. Jin, “HITS: High-coverage LLM- based unit test generation via method slicing,” inProceedings of the IEEE/ACM International Conference on Automated Software Engineer- ing (ASE), 2024
2024
-
[19]
Challenges and solutions when adopting DevSecOps: A systematic review,
R. N. Rajapakse, M. Zahedi, M. A. Babar, and H. Shen, “Challenges and solutions when adopting DevSecOps: A systematic review,”Information and Software Technology, vol. 141, 2022
2022
-
[20]
An empirical study of DevSecOps focused on continuous security testing,
C. Feioet al., “An empirical study of DevSecOps focused on continuous security testing,” inProceedings of the IEEE European Symposium on Security and Privacy Workshops (EuroS&PW), 2024
2024
-
[21]
Guidelines for performing systematic literature reviews in software engineering,
B. Kitchenham and S. Charters, “Guidelines for performing systematic literature reviews in software engineering,” Keele University, Tech. Rep. EBSE-2007-01, 2007
2007
-
[22]
Llm-assisted static analysis for detecting security vulnerabilities,
Z. Li, S. Dutta, and M. Naik, “IRIS: LLM-assisted static analysis for de- tecting security vulnerabilities,” inProceedings of the International Con- ference on Learning Representations (ICLR), 2025, arXiv:2405.17238
-
[23]
LLMs in software security: A survey of vulnerability detection techniques and insights,
Z. Sheng, Z. Chen, S. Gu, H. Huang, G. Gu, and J. Huang, “LLMs in software security: A survey of vulnerability detection techniques and insights,”ACM Computing Surveys, vol. 1, no. 1, pp. 1–33, 2025
2025
-
[24]
Abstract interpretation: A unified lattice model for static analysis of programs by construction or approximation of fixpoints,
P. Cousot and R. Cousot, “Abstract interpretation: A unified lattice model for static analysis of programs by construction or approximation of fixpoints,” inProceedings of the ACM Symposium on Principles of Programming Languages (POPL), 1977, pp. 238–252
1977
-
[25]
Symbolic execution and program testing,
J. C. King, “Symbolic execution and program testing,”Communications of the ACM, vol. 19, no. 7, pp. 385–394, 1976
1976
-
[26]
Program slicing,
M. Weiser, “Program slicing,” inProceedings of the International Conference on Software Engineering (ICSE), 1981, pp. 439–449
1981
-
[27]
The measurement of observer agreement for categorical data,
J. R. Landis and G. G. Koch, “The measurement of observer agreement for categorical data,”Biometrics, vol. 33, no. 1, pp. 159–174, 1977
1977
-
[28]
All you ever wanted to know about dynamic taint analysis and forward symbolic execution (but might have been afraid to ask),
E. J. Schwartz, T. Avgerinos, and D. Brumley, “All you ever wanted to know about dynamic taint analysis and forward symbolic execution (but might have been afraid to ask),” inProceedings of the IEEE Symposium on Security and Privacy (S&P), 2010
2010
-
[29]
EXE: Automatically generating inputs of death,
C. Cadar, V . Ganesh, P. M. Pawlowski, D. L. Dill, and D. R. Engler, “EXE: Automatically generating inputs of death,” inProceedings of the ACM Conference on Computer and Communications Security (CCS), 2006, pp. 322–335
2006
-
[30]
Finding security vulnerabilities in Java applications with static analysis,
V . B. Livshits and M. S. Lam, “Finding security vulnerabilities in Java applications with static analysis,” inProceedings of the USENIX Security Symposium, 2005
2005
-
[31]
Saluki: Finding taint- style vulnerabilities with static property checking,
I. Gotovchits, R. A. van Tonder, and C. Cadar, “Saluki: Finding taint- style vulnerabilities with static property checking,” inProceedings of the NDSS Symposium, 2018
2018
-
[32]
How developers engage with static analysis tools in different contexts,
C. Vassalloet al., “How developers engage with static analysis tools in different contexts,”Empirical Software Engineering, vol. 25, 2020
2020
-
[33]
An empirical characterization of security checks in CI workflows,
F. Zampettiet al., “An empirical characterization of security checks in CI workflows,” inProceedings of the IEEE/ACM International Conference on Mining Software Repositories (MSR), 2020
2020
-
[34]
Barriers to using static application security testing (SAST) tools,
T. Wadhamset al., “Barriers to using static application security testing (SAST) tools,” inProceedings of the International Conference on Evaluation and Assessment in Software Engineering (EASE), 2024
2024
-
[35]
Security smells in Ansible and Chef scripts: A replication study,
M. R. Rahman and L. Williams, “Security smells in Ansible and Chef scripts: A replication study,”ACM Transactions on Software Engineering and Methodology, vol. 28, no. 4, 2019
2019
-
[36]
The promise and peril of mining Git repositories,
C. Birdet al., “The promise and peril of mining Git repositories,” inProceedings of the IEEE/ACM International Conference on Mining Software Repositories (MSR), 2009
2009
-
[37]
An empirical study of the reliability of UNIX utilities,
B. P. Miller, L. Fredriksen, and B. So, “An empirical study of the reliability of UNIX utilities,”Communications of the ACM, vol. 33, no. 12, pp. 32–44, 1990
1990
-
[38]
American fuzzy lop,
M. Zalewski, “American fuzzy lop,” http://lcamtuf.coredump.cx/afl/, 2013
2013
-
[39]
Coverage-based grey- box fuzzing as Markov chain,
M. Böhme, V .-T. Pham, and A. Roychoudhury, “Coverage-based grey- box fuzzing as Markov chain,” inProceedings of the ACM Conference on Computer and Communications Security (CCS), 2016
2016
-
[40]
FairFuzz: A targeted mutation strategy for increasing greybox fuzz testing coverage,
C. Lemieux and K. Sen, “FairFuzz: A targeted mutation strategy for increasing greybox fuzz testing coverage,” inProceedings of the IEEE/ACM International Conference on Automated Software Engineer- ing (ASE), 2018
2018
-
[41]
Automated whitebox fuzz testing,
P. Godefroid, M. Y . Levin, and D. Molnar, “Automated whitebox fuzz testing,” inProceedings of the NDSS Symposium, 2008
2008
-
[42]
QSYM: A practical concolic execution engine tailored for hybrid fuzzing,
I. Yunet al., “QSYM: A practical concolic execution engine tailored for hybrid fuzzing,” inProceedings of the USENIX Security Symposium, 2018
2018
-
[43]
NEUZZ: Efficient fuzzing with neural program smoothing,
D. She, K. Pei, D. Epstein, J. Yang, B. Ray, and S. Jana, “NEUZZ: Efficient fuzzing with neural program smoothing,” inProceedings of the IEEE Symposium on Security and Privacy (S&P), 2019
2019
-
[44]
Superion: Grammar-aware greybox fuzzing,
J. Wanget al., “Superion: Grammar-aware greybox fuzzing,” inProceed- ings of the International Conference on Software Engineering (ICSE), 2019
2019
-
[45]
Angora: Efficient fuzzing by principled search,
P. Chen and H. Chen, “Angora: Efficient fuzzing by principled search,” inProceedings of the IEEE Symposium on Security and Privacy (S&P), 2018
2018
-
[46]
Metrics are fitness functions too,
M. Harman and J. Clark, “Metrics are fitness functions too,” inPro- ceedings of the IEEE International Symposium on Software Metrics (METRICS), 2004
2004
-
[47]
EvoSuite: Automatic test suite generation for object-oriented software,
G. Fraser and A. Arcuri, “EvoSuite: Automatic test suite generation for object-oriented software,” inProceedings of the ACM International Symposium on the Foundations of Software Engineering (FSE), 2011, pp. 416–419
2011
-
[48]
Do automatically generated unit tests find real faults? an empirical study of developer-written tests and three state-of- the-art tools,
S. Shamshiriet al., “Do automatically generated unit tests find real faults? an empirical study of developer-written tests and three state-of- the-art tools,” inProceedings of the IEEE/ACM International Conference on Automated Software Engineering (ASE), 2015
2015
-
[49]
Randoop: Feedback-directed random testing for Java,
C. Pacheco and M. D. Ernst, “Randoop: Feedback-directed random testing for Java,” inProceedings of the OOPSLA Companion, 2007
2007
-
[50]
Learning to generate assert statements for unit tests,
M. Tufano, R. Watson, G. Bavota, M. D. Penta, M. White, and D. Poshyvanyk, “Learning to generate assert statements for unit tests,” in Proceedings of the International Conference on Software Engineering: New Ideas and Emerging Results (ICSE-NIER), 2020
2020
-
[51]
Can large language models write good property-based tests?
A. Vikram, C. Murphy, and G. Kaiser, “Can large language models write good property-based tests?” inProceedings of the ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA), 2023
2023
-
[52]
Leveraging GPT-4 for vulnerability-witnessing unit test generation,
G. Antal, D. Bán, M. Isztin, R. Ferenc, and P. Hegedüs, “Leveraging GPT-4 for vulnerability-witnessing unit test generation,” inProceedings of the International Conference on Evaluation and Assessment in Soft- ware Engineering (EASE), 2025, pp. 1056–1065
2025
-
[53]
Mutation- guided LLM-based test generation at Meta,
M. Harman, J. Ritchey, I. Harper, S. Sengupta, and K. Mao, “Mutation- guided LLM-based test generation at Meta,” inCompanion Proceedings of the ACM International Conference on the Foundations of Software Engineering (FSE), 2025, pp. 180–191
2025
-
[54]
Low-cost and comprehensive non-textual input fuzzing via LLM-synthesized generators,
X. Zhanget al., “Low-cost and comprehensive non-textual input fuzzing via LLM-synthesized generators,” inProceedings of the USENIX Secu- rity Symposium, 2025
2025
-
[55]
Static pro- gram analysis guided LLM based unit test generation,
S. R. Chowdhury, G. Sridhara, A. K. Raghavan, and J. Bose, “Static pro- gram analysis guided LLM based unit test generation,” inProceedings of the ACM IKDD CODS and COMAD, 2024, pp. 279–283
2024
-
[56]
Symbolic execution for software testing: Three decades later,
C. Cadar and K. Sen, “Symbolic execution for software testing: Three decades later,”Communications of the ACM, vol. 56, no. 2, pp. 82–90, 2013
2013
-
[57]
A survey of new trends in symbolic execution for software testing and analysis,
C. S. P ˘as˘areanu and W. Visser, “A survey of new trends in symbolic execution for software testing and analysis,”International Journal on Software Tools for Technology Transfer, vol. 11, no. 4, 2009
2009
-
[58]
Pezzè and M
M. Pezzè and M. Young,Software Testing and Analysis: Process, Principles and Techniques. New York, NY: Wiley, 2008
2008
-
[59]
Y ASA: Scalable multi-language taint analysis on the unified AST,
A. G. S. Team, “Y ASA: Scalable multi-language taint analysis on the unified AST,”arXiv preprint arXiv:2601.17390, 2026
-
[60]
LLM-powered security test generation: Oracles, vulnerability probes, and adversarial inputs,
A. Mastropaolo, R. Kuhn, J. V oas, and B. Baudry, “LLM-powered security test generation: Oracles, vulnerability probes, and adversarial inputs,”IEEE Computer, vol. 59, no. 2, pp. 101–107, 2026
2026
-
[61]
Enhancing DevSecOps through large language model integration: A pipeline-centric approach,
M. Kisielewicz, P. Kotzbach, and M. K˛ edziora, “Enhancing DevSecOps through large language model integration: A pipeline-centric approach,” inProceedings of the International Conference on Computational Col- lective Intelligence (ICCI), 2026. TABLE V COMPLETEATTRIBUTEEXTRACTION FORALL55 PRIMARYSTUDIES(P01–P55) ID First Author Venue Year Rep. LLM ML Adapt...
2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.