OpenAnt: LLM-Powered Vulnerability Discovery Through Code Decomposition, Adversarial Verification, and Dynamic Testing
Pith reviewed 2026-06-26 20:19 UTC · model grok-4.3
The pith
OpenAnt decomposes codebases and uses LLM reasoning plus sandbox testing to find unknown vulnerabilities with fewer false positives.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
OpenAnt integrates static program analysis with LLM-based reasoning in a multi-stage pipeline. Codebases are decomposed into self-contained analysis units filtered by reachability from external entry points, reducing the analysis surface by up to 97% while preserving attack-relevant code. Candidate vulnerabilities then undergo adversarial verification through constrained attacker simulation. Findings are validated through dynamic verification in which exploit environments are generated automatically, executed in sandboxed containers, and discarded after use. Evaluation on widely used open-source projects shows the architecture identifies previously unknown vulnerabilities while maintaining m
What carries the argument
OpenAnt's three-stage pipeline that decomposes reachable code units, runs adversarial attacker simulation for verification, and performs automated dynamic execution in sandboxes.
If this is right
- Previously unknown vulnerabilities can be identified in widely used projects such as OpenSSL, WordPress, and Flowise.
- Analysis surface is reduced by up to 97 percent while attack-relevant code is retained.
- False positives are substantially lower than those produced by traditional static analysis alone.
- Analysis cost remains manageable for repository-scale security work.
- Closed-loop pipelines that combine semantic reasoning with exploit validation offer a route to scalable automated security analysis.
Where Pith is reading between the lines
- The same decomposition and validation loop could be applied to detect configuration or logic errors that are not classic memory-safety bugs.
- Pairing the initial decomposition step with existing static analyzers might further trim the set of units sent to the LLM stage.
- The sandbox generation process suggests a route for embedding continuous vulnerability checks inside developer build pipelines.
Load-bearing premise
Filtering code into self-contained units by reachability from external entry points keeps every attack-relevant path without dropping exploitable code.
What would settle it
Run OpenAnt on a codebase containing several known, documented vulnerabilities and measure whether it reports them while keeping the number of candidates that fail sandbox verification low.
read the original abstract
Automated vulnerability discovery in large codebases remains challenging: traditional static analysis produces high false-positive rates, while dynamic approaches such as fuzzing require substantial infrastructure and often target narrow classes of bugs. Recent advances in large language models (LLMs) enable semantic reasoning about program behavior, but applying LLMs to repository-scale security analysis introduces challenges related to context management, cost, and verification. We present OpenAnt, an open-source vulnerability discovery system that integrates static program analysis with LLM-based reasoning in a multi-stage pipeline. OpenAnt introduces three key techniques. First, codebases are decomposed into self-contained analysis units filtered by reachability from external entry points, reducing the analysis surface by up to 97% while preserving attack-relevant code. Second, candidate vulnerabilities undergo adversarial verification through constrained attacker simulation, where the model evaluates exploitability under realistic attacker capabilities. Third, findings are validated through dynamic verification, in which exploit environments are generated automatically, executed in sandboxed containers, and discarded after use. Evaluation on widely used open-source projects including OpenSSL, WordPress, and Flowise shows that this architecture can identify previously unknown vulnerabilities while maintaining manageable analysis cost and substantially reducing false positives. Our results suggest that closed-loop vulnerability discovery pipelines, combining semantic reasoning with exploit validation, provide a practical path toward scalable automated security analysis. OpenAnt is released as open source under the Apache 2.0 license at https://github.com/knostic/OpenAnt.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents OpenAnt, an open-source system integrating static analysis and LLM reasoning for vulnerability discovery. It decomposes codebases into self-contained units filtered by reachability from external entry points (claimed to reduce analysis surface by up to 97% while preserving attack-relevant code), applies adversarial verification via constrained attacker simulation, and validates findings through automated dynamic testing in sandboxed containers. Evaluation on OpenSSL, WordPress, and Flowise is claimed to identify previously unknown vulnerabilities at manageable cost with substantially reduced false positives.
Significance. If the empirical claims are substantiated with quantitative data and validation of modeling assumptions, the work could advance practical automated security analysis by combining semantic LLM reasoning with verification stages to address scalability and false-positive issues in large codebases. The open-source release under Apache 2.0 supports reproducibility and extension.
major comments (2)
- [Abstract] Abstract (evaluation paragraph): The central claim that the architecture identifies previously unknown vulnerabilities while reducing false positives is stated without any supporting metrics, counts of vulnerabilities found, false-positive rates, analysis costs, or evaluation methodology. The soundness of the result cannot be assessed.
- [Code decomposition] Code decomposition (first key technique): The assumption that reachability filtering from external entry points preserves all attack-relevant code is presented without a soundness argument or empirical validation (e.g., checking whether any known CVEs in the evaluated projects fall outside the retained units). This is load-bearing for the 97% reduction claim, as under-approximation could exclude internally triggered or chained attack surfaces.
Simulated Author's Rebuttal
We thank the referee for the constructive comments, which highlight opportunities to strengthen the presentation of our empirical claims and the justification for our core techniques. We address each major comment below and commit to revisions that improve clarity and substantiation without altering the underlying contributions.
read point-by-point responses
-
Referee: [Abstract] Abstract (evaluation paragraph): The central claim that the architecture identifies previously unknown vulnerabilities while reducing false positives is stated without any supporting metrics, counts of vulnerabilities found, false-positive rates, analysis costs, or evaluation methodology. The soundness of the result cannot be assessed.
Authors: We agree that the abstract's evaluation paragraph would be strengthened by including concrete quantitative metrics. In the revised manuscript we will update the abstract to reference key results from the evaluation section, including the number of previously unknown vulnerabilities identified across the three projects, measured false-positive rates before and after the verification stages, and per-project analysis costs (in LLM tokens and wall-clock time). This will allow readers to assess the claims directly from the abstract while preserving its high-level nature. revision: yes
-
Referee: [Code decomposition] Code decomposition (first key technique): The assumption that reachability filtering from external entry points preserves all attack-relevant code is presented without a soundness argument or empirical validation (e.g., checking whether any known CVEs in the evaluated projects fall outside the retained units). This is load-bearing for the 97% reduction claim, as under-approximation could exclude internally triggered or chained attack surfaces.
Authors: Reachability filtering from external entry points follows standard practice in security-oriented static analysis to focus on externally triggerable code. We will add an explicit soundness discussion and empirical validation in the revised manuscript: we will report the fraction of known CVEs from the evaluated projects (OpenSSL, WordPress) that remain inside the retained units after filtering, and we will discuss the threat model under which internally triggered or chained surfaces are considered out of scope. If the validation reveals any excluded CVEs, we will qualify the 97% reduction claim accordingly. revision: yes
Circularity Check
No circularity: empirical system description with no derivations or fitted parameters
full rationale
The paper presents OpenAnt as an engineering system combining static analysis, LLM reasoning, adversarial verification, and dynamic testing. The abstract and described techniques rely on empirical evaluation on OpenSSL, WordPress, and Flowise rather than any mathematical derivation chain, equations, or parameter fitting. The reachability filtering claim is stated as a design choice that reduces surface while preserving attack-relevant code, without reduction to a self-referential definition or prior self-citation that bears the central result. No load-bearing step reduces by construction to its own inputs; the work is self-contained as an applied pipeline description.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
- [1]
-
[2]
Kuszczyński, K., & Walkowski, M. (2023). Comparative analysis of open-source tools for conducting static code analysis. Sensors, 23(18), 7753
2023
-
[3]
Johnson, B., Song, Y., Murphy-Hill, E., & Bowdidge, R. (2013). Why don’t software developers use static analysis tools to find bugs? In Proceedings of the 2013 International Conference on Software Engineering (ICSE) (pp. 672–681). IEEE
2013
-
[4]
Christakis, M., & Bird, C. (2016). What developers want and need from program analysis: An empirical study. In Proceedings of the 31st IEEE/ACM International Conference on Automated Software Engineering (ASE) (pp. 332–343). IEEE
2016
-
[5]
Bennett, G., et al. (2024). Do developers use static application security testing tools? ACM Computing Surveys. 14
2024
-
[6]
Aloraini, B., et al. (2019). An empirical study of security warnings from static application security testing tools. Journal of Systems and Software, 148, 230–245
2019
- [7]
-
[8]
Ma, W., Liu, S., Lin, Z., et al. (2023). LLMs: Understanding code syntax and semantics for code analysis. arXiv preprint arXiv:2305.12138
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[9]
Pearce, H., Ahmad, A., Tan, B., et al. (2022). Asleep at the keyboard? Assessing the security of GitHub Copilot’s code contributions. In Proceedings of the IEEE Symposium on Security and Privacy (S&P) (pp. 754–768)
2022
-
[10]
Fried, D., Chan, A., Darrell, T., & Klein, D. (2023). Code as policies: Language model programs for embodied control. In Proceedings of Robotics: Science and Systems (RSS)
2023
-
[11]
F., Lin, K., Hewitt, J., Paranjape, A., Bevilacqua, M., Petroni, F., & Liang, P
Liu, N. F., Lin, K., Hewitt, J., Paranjape, A., Bevilacqua, M., Petroni, F., & Liang, P. (2024). Lost in the middle: How language models use long contexts. Transactions of the Association for Computational Linguistics, 12, 157–173
2024
-
[12]
National Institute of Standards and Technology. (n.d.). Software Assurance Reference Dataset (SARD): Juliet test suite. Retrieved June 2026, fromhttps://samate.nist.gov/ SARD/
2026
-
[13]
OWASP Foundation. (n.d.). OWASP Benchmark project. Retrieved June 2026, from https://owasp.org/www-project-benchmark/
2026
-
[14]
Riddell, M., Ni, A., & Cohan, A. (2024). Quantifying contamination in evaluating code generation capabilities of language models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL) (pp. 14116–14137)
2024
- [15]
-
[16]
A., Garcia-Ferrero, I., et al
Sainz, O., Campos, J. A., Garcia-Ferrero, I., et al. (2023). NLP evaluation in trouble: On the need to measure LLM data contamination for each benchmark. In Findings of the Association for Computational Linguistics: EMNLP 2023
2023
-
[17]
Li, Y., et al. (2024). Unveiling the spectrum of data contamination in language models: A survey from detection to remediation. In Findings of the Association for Computational Linguistics (ACL 2024)
2024
-
[18]
A., & Kholoosi, M
Croft, R., Babar, M. A., & Kholoosi, M. M. (2023). Data quality for software vulnerability datasets. In Proceedings of the 45th International Conference on Software Engineering (ICSE)
2023
-
[19]
Ding, Y., Fu, Y., Ibrahim, O., et al. (2025). Vulnerability detection with code language models: How far are we? In Proceedings of the 47th International Conference on Software Engineering (ICSE)
2025
- [20]
-
[21]
Semgrep. (n.d.). Semgrep: Static analysis for finding bugs and enforcing code standards. Retrieved June 2026, fromhttps://semgrep.dev
2026
-
[22]
GitHub. (n.d.). CodeQL: Semantic code analysis engine. Retrieved June 2026, fromhttps: //codeql.github.com 15
2026
-
[23]
E., Koo, H., & Okun, V
Black, P. E., Koo, H., & Okun, V. (2013). Report on the Static Analysis Tool Exposition (SATE) IV. NIST Special Publication 500-297
2013
-
[24]
Google Project Zero & Google DeepMind. (2024). From naptime to big sleep: Using large language models to catch vulnerabilities in real-world code. Google Project Zero Blog. https://projectzero.google/2024/10/from-naptime-to-big-sleep.html
2024
-
[25]
Anthropic. (2026). Claude Security. Retrieved June 2026, fromhttps://www.anthropic. com/product/security
2026
-
[26]
OpenAI. (2025). Introducing Aardvark: OpenAI’s agentic security researcher. OpenAI Re- search Blog.https://openai.com/index/introducing-aardvark/
2025
-
[27]
Fang, R., Bindu, R., Gupta, A., & Kang, D. (2024). LLM agents can autonomously exploit one-day vulnerabilities. arXiv preprint arXiv:2404.08144
work page internal anchor Pith review Pith/arXiv arXiv 2024
- [28]
-
[29]
Happe, A., & Cito, J. (2023). Getting pwn’d by AI: Penetration testing with large language models. In Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE)
2023
-
[30]
Deng, G., Liu, Y., et al. (2024). PentestGPT: Evaluating and harnessing large language models for automated penetration testing. In Proceedings of the 33rd USENIX Security Symposium
2024
-
[31]
Zalewski, M. (2014). American fuzzy lop (AFL).https://lcamtuf.coredump.cx/afl/
2014
-
[32]
Google. (n.d.). OSS-Fuzz: Continuous fuzzing for open source software. Retrieved June 2026, fromhttps://github.com/google/oss-fuzz
2026
-
[33]
K., Avgerinos, T., Rebert, A., & Brumley, D
Cha, S. K., Avgerinos, T., Rebert, A., & Brumley, D. (2012). Unleashing Mayhem on binary code. In Proceedings of the IEEE Symposium on Security and Privacy (S&P) (pp. 380–394)
2012
-
[34]
Shoshitaishvili, Y., Wang, R., Salls, C., et al. (2016). SoK: (State of) the art of war: Offensive techniques in binary analysis. In Proceedings of the IEEE Symposium on Security and Privacy (S&P) (pp. 138–157)
2016
-
[35]
Defense Advanced Research Projects Agency. (2016). Cyber Grand Challenge.https:// www.darpa.mil/program/cyber-grand-challenge 16
2016
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.