pith. sign in

arxiv: 2606.19149 · v2 · pith:CEEPKL22new · submitted 2026-06-17 · 💻 cs.CR · cs.LG

OpenAnt: LLM-Powered Vulnerability Discovery Through Code Decomposition, Adversarial Verification, and Dynamic Testing

Pith reviewed 2026-06-26 20:19 UTC · model grok-4.3

classification 💻 cs.CR cs.LG
keywords vulnerability discoveryLLM reasoningcode decompositionadversarial verificationdynamic testingstatic analysisfalse positive reductionsecurity automation
0
0 comments X

The pith

OpenAnt decomposes codebases and uses LLM reasoning plus sandbox testing to find unknown vulnerabilities with fewer false positives.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces OpenAnt as a system that breaks large codebases into smaller reachable analysis units, applies large language models to simulate realistic attacker scenarios for candidate issues, and then generates and runs exploit code in temporary sandboxed environments to confirm findings. Traditional static tools generate too many false alarms while dynamic fuzzing needs heavy setup and misses many bug classes, so the goal is a hybrid pipeline that keeps costs reasonable for repository-scale work. Evaluation on projects including OpenSSL, WordPress, and Flowise indicates the method can surface real previously unknown problems while cutting the code under review by up to 97 percent and lowering false positives. If the approach holds, it points toward automated security checks that combine semantic understanding with concrete validation steps rather than relying on either alone.

Core claim

OpenAnt integrates static program analysis with LLM-based reasoning in a multi-stage pipeline. Codebases are decomposed into self-contained analysis units filtered by reachability from external entry points, reducing the analysis surface by up to 97% while preserving attack-relevant code. Candidate vulnerabilities then undergo adversarial verification through constrained attacker simulation. Findings are validated through dynamic verification in which exploit environments are generated automatically, executed in sandboxed containers, and discarded after use. Evaluation on widely used open-source projects shows the architecture identifies previously unknown vulnerabilities while maintaining m

What carries the argument

OpenAnt's three-stage pipeline that decomposes reachable code units, runs adversarial attacker simulation for verification, and performs automated dynamic execution in sandboxes.

If this is right

  • Previously unknown vulnerabilities can be identified in widely used projects such as OpenSSL, WordPress, and Flowise.
  • Analysis surface is reduced by up to 97 percent while attack-relevant code is retained.
  • False positives are substantially lower than those produced by traditional static analysis alone.
  • Analysis cost remains manageable for repository-scale security work.
  • Closed-loop pipelines that combine semantic reasoning with exploit validation offer a route to scalable automated security analysis.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same decomposition and validation loop could be applied to detect configuration or logic errors that are not classic memory-safety bugs.
  • Pairing the initial decomposition step with existing static analyzers might further trim the set of units sent to the LLM stage.
  • The sandbox generation process suggests a route for embedding continuous vulnerability checks inside developer build pipelines.

Load-bearing premise

Filtering code into self-contained units by reachability from external entry points keeps every attack-relevant path without dropping exploitable code.

What would settle it

Run OpenAnt on a codebase containing several known, documented vulnerabilities and measure whether it reports them while keeping the number of candidates that fail sandbox verification low.

read the original abstract

Automated vulnerability discovery in large codebases remains challenging: traditional static analysis produces high false-positive rates, while dynamic approaches such as fuzzing require substantial infrastructure and often target narrow classes of bugs. Recent advances in large language models (LLMs) enable semantic reasoning about program behavior, but applying LLMs to repository-scale security analysis introduces challenges related to context management, cost, and verification. We present OpenAnt, an open-source vulnerability discovery system that integrates static program analysis with LLM-based reasoning in a multi-stage pipeline. OpenAnt introduces three key techniques. First, codebases are decomposed into self-contained analysis units filtered by reachability from external entry points, reducing the analysis surface by up to 97% while preserving attack-relevant code. Second, candidate vulnerabilities undergo adversarial verification through constrained attacker simulation, where the model evaluates exploitability under realistic attacker capabilities. Third, findings are validated through dynamic verification, in which exploit environments are generated automatically, executed in sandboxed containers, and discarded after use. Evaluation on widely used open-source projects including OpenSSL, WordPress, and Flowise shows that this architecture can identify previously unknown vulnerabilities while maintaining manageable analysis cost and substantially reducing false positives. Our results suggest that closed-loop vulnerability discovery pipelines, combining semantic reasoning with exploit validation, provide a practical path toward scalable automated security analysis. OpenAnt is released as open source under the Apache 2.0 license at https://github.com/knostic/OpenAnt.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper presents OpenAnt, an open-source system integrating static analysis and LLM reasoning for vulnerability discovery. It decomposes codebases into self-contained units filtered by reachability from external entry points (claimed to reduce analysis surface by up to 97% while preserving attack-relevant code), applies adversarial verification via constrained attacker simulation, and validates findings through automated dynamic testing in sandboxed containers. Evaluation on OpenSSL, WordPress, and Flowise is claimed to identify previously unknown vulnerabilities at manageable cost with substantially reduced false positives.

Significance. If the empirical claims are substantiated with quantitative data and validation of modeling assumptions, the work could advance practical automated security analysis by combining semantic LLM reasoning with verification stages to address scalability and false-positive issues in large codebases. The open-source release under Apache 2.0 supports reproducibility and extension.

major comments (2)
  1. [Abstract] Abstract (evaluation paragraph): The central claim that the architecture identifies previously unknown vulnerabilities while reducing false positives is stated without any supporting metrics, counts of vulnerabilities found, false-positive rates, analysis costs, or evaluation methodology. The soundness of the result cannot be assessed.
  2. [Code decomposition] Code decomposition (first key technique): The assumption that reachability filtering from external entry points preserves all attack-relevant code is presented without a soundness argument or empirical validation (e.g., checking whether any known CVEs in the evaluated projects fall outside the retained units). This is load-bearing for the 97% reduction claim, as under-approximation could exclude internally triggered or chained attack surfaces.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which highlight opportunities to strengthen the presentation of our empirical claims and the justification for our core techniques. We address each major comment below and commit to revisions that improve clarity and substantiation without altering the underlying contributions.

read point-by-point responses
  1. Referee: [Abstract] Abstract (evaluation paragraph): The central claim that the architecture identifies previously unknown vulnerabilities while reducing false positives is stated without any supporting metrics, counts of vulnerabilities found, false-positive rates, analysis costs, or evaluation methodology. The soundness of the result cannot be assessed.

    Authors: We agree that the abstract's evaluation paragraph would be strengthened by including concrete quantitative metrics. In the revised manuscript we will update the abstract to reference key results from the evaluation section, including the number of previously unknown vulnerabilities identified across the three projects, measured false-positive rates before and after the verification stages, and per-project analysis costs (in LLM tokens and wall-clock time). This will allow readers to assess the claims directly from the abstract while preserving its high-level nature. revision: yes

  2. Referee: [Code decomposition] Code decomposition (first key technique): The assumption that reachability filtering from external entry points preserves all attack-relevant code is presented without a soundness argument or empirical validation (e.g., checking whether any known CVEs in the evaluated projects fall outside the retained units). This is load-bearing for the 97% reduction claim, as under-approximation could exclude internally triggered or chained attack surfaces.

    Authors: Reachability filtering from external entry points follows standard practice in security-oriented static analysis to focus on externally triggerable code. We will add an explicit soundness discussion and empirical validation in the revised manuscript: we will report the fraction of known CVEs from the evaluated projects (OpenSSL, WordPress) that remain inside the retained units after filtering, and we will discuss the threat model under which internally triggered or chained surfaces are considered out of scope. If the validation reveals any excluded CVEs, we will qualify the 97% reduction claim accordingly. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical system description with no derivations or fitted parameters

full rationale

The paper presents OpenAnt as an engineering system combining static analysis, LLM reasoning, adversarial verification, and dynamic testing. The abstract and described techniques rely on empirical evaluation on OpenSSL, WordPress, and Flowise rather than any mathematical derivation chain, equations, or parameter fitting. The reachability filtering claim is stated as a design choice that reduces surface while preserving attack-relevant code, without reduction to a self-referential definition or prior self-citation that bears the central result. No load-bearing step reduces by construction to its own inputs; the work is self-contained as an applied pipeline description.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are described in the abstract.

pith-pipeline@v0.9.1-grok · 5796 in / 1098 out tokens · 23621 ms · 2026-06-26T20:19:35.178142+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

35 extracted references · 7 canonical work pages · 2 internal anchors

  1. [1]

    Shen, M., Li, Z., Xu, W., & Chen, Y. (2023). An empirical study on the use of static analysis tools in open source embedded software. arXiv preprint arXiv:2305.07023

  2. [2]

    Kuszczyński, K., & Walkowski, M. (2023). Comparative analysis of open-source tools for conducting static code analysis. Sensors, 23(18), 7753

  3. [3]

    Johnson, B., Song, Y., Murphy-Hill, E., & Bowdidge, R. (2013). Why don’t software developers use static analysis tools to find bugs? In Proceedings of the 2013 International Conference on Software Engineering (ICSE) (pp. 672–681). IEEE

  4. [4]

    Christakis, M., & Bird, C. (2016). What developers want and need from program analysis: An empirical study. In Proceedings of the 31st IEEE/ACM International Conference on Automated Software Engineering (ASE) (pp. 332–343). IEEE

  5. [5]

    Bennett, G., et al. (2024). Do developers use static application security testing tools? ACM Computing Surveys. 14

  6. [6]

    Aloraini, B., et al. (2019). An empirical study of security warnings from static application security testing tools. Journal of Systems and Software, 148, 230–245

  7. [7]

    B., et al

    Okutan, A., Grichi, M., Dwyer, M. B., et al. (2024). An empirical study of static analysis tools for secure code review. arXiv preprint arXiv:2407.12241

  8. [8]

    Ma, W., Liu, S., Lin, Z., et al. (2023). LLMs: Understanding code syntax and semantics for code analysis. arXiv preprint arXiv:2305.12138

  9. [9]

    Pearce, H., Ahmad, A., Tan, B., et al. (2022). Asleep at the keyboard? Assessing the security of GitHub Copilot’s code contributions. In Proceedings of the IEEE Symposium on Security and Privacy (S&P) (pp. 754–768)

  10. [10]

    Fried, D., Chan, A., Darrell, T., & Klein, D. (2023). Code as policies: Language model programs for embodied control. In Proceedings of Robotics: Science and Systems (RSS)

  11. [11]

    F., Lin, K., Hewitt, J., Paranjape, A., Bevilacqua, M., Petroni, F., & Liang, P

    Liu, N. F., Lin, K., Hewitt, J., Paranjape, A., Bevilacqua, M., Petroni, F., & Liang, P. (2024). Lost in the middle: How language models use long contexts. Transactions of the Association for Computational Linguistics, 12, 157–173

  12. [12]

    National Institute of Standards and Technology. (n.d.). Software Assurance Reference Dataset (SARD): Juliet test suite. Retrieved June 2026, fromhttps://samate.nist.gov/ SARD/

  13. [13]

    OWASP Foundation. (n.d.). OWASP Benchmark project. Retrieved June 2026, from https://owasp.org/www-project-benchmark/

  14. [14]

    Riddell, M., Ni, A., & Cohan, A. (2024). Quantifying contamination in evaluating code generation capabilities of language models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL) (pp. 14116–14137)

  15. [15]

    Guo, D., et al. (2025). LessLeak-Bench: A first investigation of data leakage in LLMs across 83 software engineering benchmarks. arXiv preprint arXiv:2502.06215

  16. [16]

    A., Garcia-Ferrero, I., et al

    Sainz, O., Campos, J. A., Garcia-Ferrero, I., et al. (2023). NLP evaluation in trouble: On the need to measure LLM data contamination for each benchmark. In Findings of the Association for Computational Linguistics: EMNLP 2023

  17. [17]

    Li, Y., et al. (2024). Unveiling the spectrum of data contamination in language models: A survey from detection to remediation. In Findings of the Association for Computational Linguistics (ACL 2024)

  18. [18]

    A., & Kholoosi, M

    Croft, R., Babar, M. A., & Kholoosi, M. M. (2023). Data quality for software vulnerability datasets. In Proceedings of the 45th International Conference on Software Engineering (ICSE)

  19. [19]

    Ding, Y., Fu, Y., Ibrahim, O., et al. (2025). Vulnerability detection with code language models: How far are we? In Proceedings of the 47th International Conference on Software Engineering (ICSE)

  20. [20]

    Yang, X., et al. (2023). Understanding the effectiveness of large language models in detect- ing security vulnerabilities. arXiv preprint arXiv:2311.16169

  21. [21]

    Semgrep. (n.d.). Semgrep: Static analysis for finding bugs and enforcing code standards. Retrieved June 2026, fromhttps://semgrep.dev

  22. [22]

    GitHub. (n.d.). CodeQL: Semantic code analysis engine. Retrieved June 2026, fromhttps: //codeql.github.com 15

  23. [23]

    E., Koo, H., & Okun, V

    Black, P. E., Koo, H., & Okun, V. (2013). Report on the Static Analysis Tool Exposition (SATE) IV. NIST Special Publication 500-297

  24. [24]

    Google Project Zero & Google DeepMind. (2024). From naptime to big sleep: Using large language models to catch vulnerabilities in real-world code. Google Project Zero Blog. https://projectzero.google/2024/10/from-naptime-to-big-sleep.html

  25. [25]

    Anthropic. (2026). Claude Security. Retrieved June 2026, fromhttps://www.anthropic. com/product/security

  26. [26]

    OpenAI. (2025). Introducing Aardvark: OpenAI’s agentic security researcher. OpenAI Re- search Blog.https://openai.com/index/introducing-aardvark/

  27. [27]

    Fang, R., Bindu, R., Gupta, A., & Kang, D. (2024). LLM agents can autonomously exploit one-day vulnerabilities. arXiv preprint arXiv:2404.08144

  28. [28]

    Fang, R., Bindu, R., Gupta, A., Zhan, Q., & Kang, D. (2024). LLM agents can au- tonomously hack websites. arXiv preprint arXiv:2402.06664

  29. [29]

    Happe, A., & Cito, J. (2023). Getting pwn’d by AI: Penetration testing with large language models. In Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE)

  30. [30]

    Deng, G., Liu, Y., et al. (2024). PentestGPT: Evaluating and harnessing large language models for automated penetration testing. In Proceedings of the 33rd USENIX Security Symposium

  31. [31]

    Zalewski, M. (2014). American fuzzy lop (AFL).https://lcamtuf.coredump.cx/afl/

  32. [32]

    Google. (n.d.). OSS-Fuzz: Continuous fuzzing for open source software. Retrieved June 2026, fromhttps://github.com/google/oss-fuzz

  33. [33]

    K., Avgerinos, T., Rebert, A., & Brumley, D

    Cha, S. K., Avgerinos, T., Rebert, A., & Brumley, D. (2012). Unleashing Mayhem on binary code. In Proceedings of the IEEE Symposium on Security and Privacy (S&P) (pp. 380–394)

  34. [34]

    Shoshitaishvili, Y., Wang, R., Salls, C., et al. (2016). SoK: (State of) the art of war: Offensive techniques in binary analysis. In Proceedings of the IEEE Symposium on Security and Privacy (S&P) (pp. 138–157)

  35. [35]

    Defense Advanced Research Projects Agency. (2016). Cyber Grand Challenge.https:// www.darpa.mil/program/cyber-grand-challenge 16