pith. machine review for the scientific record. sign in

arxiv: 2605.00034 · v1 · submitted 2026-04-28 · 💻 cs.CR · cs.PL· cs.SE

Recognition: unknown

Symbolic Execution Meets Multi-LLM Orchestration: Detecting Memory Vulnerabilities in Incomplete Rust CVE Snippets

Authors on Pith no claims yet

Pith reviewed 2026-05-09 20:01 UTC · model grok-4.3

classification 💻 cs.CR cs.PLcs.SE
keywords Rust securitymemory vulnerabilitiessymbolic executionmulti-LLM orchestrationCVE analysisincomplete codevulnerability detectiongraph database
0
0 comments X

The pith

A four-agent LLM system synthesizes symbolic execution harnesses from incomplete Rust CVE snippets that defeat existing tools.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper demonstrates that dividing LLM labor across four specialized roles allows construction of runnable test wrappers from isolated code fragments taken from security reports. This matters because CVE entries routinely omit the surrounding definitions and build files needed for any formal analysis, leaving most reported issues outside the reach of verification engines. The agents handle planning, vulnerability scanning, interface code generation, and filtering so that execution paths can be explored and then stored in a linked graph for later queries. If the approach holds, previously unanalyzable snippets become sources of concrete memory error data across dozens of real cases.

Core claim

The paper claims that a multi-agent LLM architecture with four distinct roles collaboratively produces compilable symbolic execution harnesses from otherwise unbuildable Rust CVE fragments, reaching 90.3 percent wrapper success on 31 real cases spanning 11 weakness categories and surfacing 1,206 critical errors in 26 files while a single-agent version yields only 487 errors and conventional linters produce far fewer warnings.

What carries the argument

The four-agent multi-LLM architecture that plans analysis, checks safety, generates interface wrappers, and optimizes execution to turn incomplete code fragments into analyzable harnesses whose results feed a graph database of vulnerabilities.

If this is right

  • Compilation success for incomplete CVE fragments rises above 90 percent where prior tools achieve none.
  • Detected critical errors more than double when role specialization replaces a single general model.
  • A graph database connects individual files, weakness types, error categories, and execution paths to support structured cross-report queries.
  • The same fragments that produce zero output from standard verification now yield measurable error data for security analysis.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The orchestration pattern could apply to partial code in languages other than Rust whenever full context is missing from reports.
  • The resulting vulnerability graph might enable automated grouping of CVEs that share similar root causes across different projects.
  • If wrapper quality can be checked against complete code when it later becomes available, the method gains an empirical reliability test.
  • Security teams could shift from waiting for full reproducible projects to extracting signals directly from the snippets that vendors already publish.

Load-bearing premise

The wrappers built by the agents preserve the original behavior of the snippets without adding or hiding memory issues that were not present in the reported code.

What would settle it

Independent review of the generated wrappers against the original CVE descriptions and any available full code to determine whether the reported execution errors match actual vulnerabilities rather than synthesis artifacts.

Figures

Figures reproduced from arXiv: 2605.00034 by Young Lee, Zeyad Abdelrazek.

Figure 1
Figure 1. Figure 1: 4-Agent Pipeline Overview. The pipeline proceeds left to right through six stages. (1) A Rust CVE snippet (incomplete [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: shows the Graph DB visualisation produced by running graph_klee.py on klee_output/cwe-131-cve-2020-35904. Teal nodes represent vulnerability functions generated by the Code Specialist (e.g., data_race_incr..., out_of_bound...); yellow nodes represent the concrete KLEE error files they trigger (e.g., test001079.exter..., test000009.ptr.er...). Each directed Triggers edge connects a function to a specific fa… view at source ↗
Figure 3
Figure 3. Figure 3: Critical errors detected by KLEE per CWE cate [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Detection rate comparison across evaluated tools. [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
read the original abstract

This paper presents a system combining symbolic execution (KLEE) with a 4-agent multi-LLM architecture for detecting memory vulnerabilities in Rust unsafe code. A central challenge we address is the incomplete-code problem: CVE database entries provide only isolated code snippets that lack struct definitions, imports, and Cargo manifests, causing all existing formal verification tools to fail at compilation with zero output. Our system resolves this through four specialized agents -- an Oracle/Validator for strategic planning, a Safety Checker for vulnerability analysis, a Code Specialist for FFI wrapper generation, and a Fast Filter for execution optimization -- that collaboratively synthesize KLEE-compatible harnesses from otherwise uncompilable fragments. KLEE's output is then ingested by graph_klee.py, which constructs a Graph Database linking CVE files, CWE categories, error types, and symbolic execution paths as typed nodes and labelled edges, enabling structured cross-CVE vulnerability queries. We evaluated our system on 31 real-world Rust CVEs spanning 11 CWE categories, achieving 90.3% wrapper compilation success where all state-of-the-art formal verification tools achieve 0%. Our system detected 1,206 critical errors across 26 files (83.9% detection rate), compared to 14 warnings across 11 files for Clippy (35.5%) and generic labels for Miri. The 4-agent architecture reduced wrapper compilation failures from 42% (single-agent baseline) to 9.7% and increased detected errors from 487 to 1,206, confirming that role specialization and structured context passing produce measurably better results than a single general-purpose model. Our replication package is publicly available at https://github.com/Zeyad-Ab/Symbolic-Execution-with-Multi-LLM-Architecture-for-Rust-Security

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript presents a 4-agent multi-LLM system (Oracle/Validator, Safety Checker, Code Specialist, Fast Filter) that synthesizes KLEE-compatible harnesses from incomplete Rust CVE snippets, enabling symbolic execution where standard tools fail at compilation. It reports 90.3% wrapper success on 31 CVEs, detection of 1,206 critical errors in 26 files (83.9% rate), a graph database for cross-CVE queries, and improvements over a single-agent baseline (9.7% vs 42% failures, 1206 vs 487 errors) plus comparisons to Clippy and Miri.

Significance. If the synthesized harnesses are semantically faithful and the KLEE errors are true positives, the work addresses a practical barrier in analyzing real-world incomplete security disclosures. The public replication package strengthens reproducibility and allows independent inspection of the agent orchestration and graph construction pipeline.

major comments (2)
  1. [Evaluation (as summarized in the abstract)] The headline quantitative results (1,206 errors, 83.9% detection rate, 90.3% compilation success) are load-bearing for the central claim yet rest on an unverified assumption that LLM-generated harnesses preserve the original snippet semantics. No manual audit of detected errors, no comparison against independently completed versions of the same CVEs, and no false-positive measurement on KLEE output are described, so it is unclear whether flagged memory violations reflect genuine vulnerabilities or artifacts from invented struct definitions, imports, FFI wrappers, or altered control flow.
  2. [Evaluation (as summarized in the abstract)] The single-agent baseline comparison demonstrates that role specialization increases compilable harnesses and reported errors, but does not test whether the additional detections are semantically correct rather than spurious paths introduced by the multi-agent synthesis process.
minor comments (2)
  1. The role of graph_klee.py and the resulting Graph Database (nodes for CVEs, CWEs, error types, paths; labelled edges) would benefit from a concrete schema diagram or example query to clarify how structured cross-CVE analysis is performed.
  2. [Abstract] The abstract states that all state-of-the-art formal verification tools achieve 0% success; a brief enumeration of the specific tools attempted and the exact compilation failure modes would strengthen this claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below, acknowledge the points where our current evaluation is limited, and commit to specific revisions that strengthen the claims without misrepresenting the existing results.

read point-by-point responses
  1. Referee: [Evaluation (as summarized in the abstract)] The headline quantitative results (1,206 errors, 83.9% detection rate, 90.3% compilation success) are load-bearing for the central claim yet rest on an unverified assumption that LLM-generated harnesses preserve the original snippet semantics. No manual audit of detected errors, no comparison against independently completed versions of the same CVEs, and no false-positive measurement on KLEE output are described, so it is unclear whether flagged memory violations reflect genuine vulnerabilities or artifacts from invented struct definitions, imports, FFI wrappers, or altered control flow.

    Authors: We agree that semantic fidelity of the generated harnesses is a critical assumption underlying the reported error counts. The manuscript demonstrates that the 4-agent system produces compilable harnesses at 90.3% where all prior tools achieve 0%, and that these harnesses enable KLEE to surface 1,206 errors; however, we did not include a manual audit of harness correctness, a comparison to independently completed CVE versions, or a false-positive analysis of KLEE outputs. In the revised manuscript we will add a dedicated subsection to the Evaluation that reports: (1) manual review of a random sample of 50 harnesses for semantic alignment with the original CVE snippets, (2) KLEE results on any CVEs for which independently completed versions can be obtained, and (3) a qualitative assessment of a subset of reported errors to identify potential artifacts. These additions will be reflected in the abstract and conclusion as well. revision: yes

  2. Referee: [Evaluation (as summarized in the abstract)] The single-agent baseline comparison demonstrates that role specialization increases compilable harnesses and reported errors, but does not test whether the additional detections are semantically correct rather than spurious paths introduced by the multi-agent synthesis process.

    Authors: The single-agent baseline was included to quantify the benefit of role specialization on compilation success (42% failures reduced to 9.7%) and on the number of KLEE-detected errors (487 to 1,206). We acknowledge that this comparison does not verify whether the additional errors are semantically valid rather than artifacts of the multi-agent harness synthesis. The manual audit and fidelity checks described in our response to the first comment will be applied equally to both the multi-agent and single-agent outputs on the sampled CVEs. This will allow us to report whether the increase in detections corresponds to more true positives or to additional spurious paths, and we will add this comparative analysis to the revised Evaluation section. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical results rest on external tool comparisons and real CVE data

full rationale

The paper's claims rest on an empirical evaluation of a 4-agent LLM system synthesizing KLEE harnesses for 31 incomplete Rust CVE snippets, reporting concrete metrics (90.3% compilation success, 1206 detected errors, 83.9% detection rate) obtained by running the pipeline and comparing outputs against Clippy (14 warnings) and Miri on the same inputs. No equations, fitted parameters, or first-principles derivations are present; the improvement from single-agent (42% failure, 487 errors) to 4-agent baseline is measured directly rather than defined into existence. No self-citations, uniqueness theorems, or ansatzes appear in the provided text, and the replication package is offered for external verification. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 4 invented entities

The central claim relies on the effectiveness of newly defined agent roles and the assumption that symbolic execution on synthesized harnesses accurately reflects vulnerabilities in the original code.

axioms (1)
  • domain assumption Large language models can accurately generate KLEE-compatible harnesses and FFI wrappers from incomplete code snippets
    The success of the system depends on the LLM agents' ability to produce correct and compilable code.
invented entities (4)
  • Oracle/Validator agent no independent evidence
    purpose: Strategic planning for harness generation
    Specialized role in the multi-agent architecture.
  • Safety Checker agent no independent evidence
    purpose: Vulnerability analysis
    Specialized role in the multi-agent architecture.
  • Code Specialist agent no independent evidence
    purpose: FFI wrapper generation
    Specialized role in the multi-agent architecture.
  • Fast Filter agent no independent evidence
    purpose: Execution optimization
    Specialized role in the multi-agent architecture.

pith-pipeline@v0.9.0 · 5632 in / 1532 out tokens · 111784 ms · 2026-05-09T20:01:36.629069+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

31 extracted references · 4 canonical work pages · 2 internal anchors

  1. [1]

    Vytautas Astrauskas, Christoph Matheja, Federico Poli, Peter Müller, and Alexan- der J Summers. 2019. Leveraging Rust Types for Modular Specification and Verification. InProceedings of the ACM on Programming Languages, Vol. 3. ACM, 1–30

  2. [2]

    James Bornholt. 2020. Haybale: Symbolic Execution of Rust Programs. https: //github.com/PLSysSec/haybale

  3. [3]

    Cristian Cadar, Daniel Dunbar, and Dawson Engler. 2008. KLEE: Unassisted and Automatic Generation of High-Coverage Tests for Complex Systems Programs. InProceedings of the 8th USENIX Symposium on Operating Systems Design and Implementation (OSDI). 209–224

  4. [4]

    Jia Chen et al. 2023. Detecting Rust Unrecoverable Panics via Symbolic Execution. InProceedings of the International Symposium on Software Testing and Analysis. ACM

  5. [5]

    Mark Chen et al . 2021. Evaluating Large Language Models Trained on Code. arXiv preprint arXiv:2107.03374(2021)

  6. [6]

    Yinghao Chen et al . 2023. ChatUniTest: A Framework for LLM-Based Test Generation. InProceedings of the 32nd ACM International Symposium on Software Testing and Analysis (ISSTA)

  7. [7]

    Vitaly Chipounov, Volodymyr Kuznetsov, and George Candea. 2011. S2E: A Platform for In-Vivo Multi-Path Analysis of Software Systems. InProceedings of the 16th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS)

  8. [8]

    Gelei Deng et al. 2024. PentestGPT: A GPT-Empowered Automatic Penetration Testing Tool. InProceedings of the 33rd USENIX Security Symposium

  9. [9]

    Yinlin Deng et al. 2023. TitanFuzz: Black-Box Fuzzing of Deep-Learning Libraries via Large Language Models.Proceedings of the 45th International Conference on Software Engineering (ICSE)(2023)

  10. [10]

    Xavier Denis, Jacques-Henri Jourdan, and Claude Marché. 2022. Creusot: A Foundry for the Deductive Verification of Rust Programs.Lecture Notes in Com- puter Science13047 (2022), 90–105

  11. [11]

    Patrice Godefroid, Nils Klarlund, and Koushik Sen. 2005. DART: Directed Au- tomated Random Testing. InProceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI). 213–223

  12. [12]

    Patrice Godefroid, Michael Y Levin, and David Molnar. 2008. Automated White- box Fuzz Testing. InProceedings of the 15th Network and Distributed System Security Symposium (NDSS)

  13. [13]

    Sirui Hong et al. 2023. MetaGPT: Meta Programming for a Multi-Agent Collabo- rative Framework.arXiv preprint arXiv:2308.00352(2023)

  14. [14]

    Bo Jiang et al. 2024. Crabtree: LLM-Guided Fuzz Driver Generation for Rust. In Proceedings of the 33rd USENIX Security Symposium

  15. [15]

    2019.The Rust Programming Language

    Steve Klabnik and Carol Nichols. 2019.The Rust Programming Language. No Starch Press

  16. [16]

    Martin Larsson. 2019. cargo-geiger: Detects unsafe Rust Code. https://github. com/rust-secure-code/cargo-geiger

  17. [17]

    Peng Liu et al. 2020. Targeted Symbolic Execution for Use-After-Free Detection. In Proceedings of the 35th IEEE/ACM International Conference on Automated Software Engineering

  18. [18]

    Yu Luo, Han Zhou, Mengtao Zhang, Dylan De La Rosa, Hafsa Ahmed, Weifeng Xu, and Dianxiang Xu. 2025. HALURust: Exploiting Hallucinations of Large Language Models to Detect Vulnerabilities in Rust. arXiv:2503.10793 [cs.CR] https://arxiv.org/abs/2503.10793

  19. [19]

    Nicholas D Matsakis and Felix S Klock. 2014. The Rust Programming Language. ACM SIGAda Ada Letters34, 3 (2014), 103–104

  20. [20]

    Scott Olson, Oliver Günther, et al. 2018. Miri: An Interpreter for Rust’s Mid-level Intermediate Representation. https://github.com/rust-lang/miri

  21. [21]

    Chen Qian et al. 2024. ChatDev: Communicative Agents for Software Develop- ment. InProceedings of the 62nd Annual Meeting of the Association for Computa- tional Linguistics

  22. [22]

    Koushik Sen, Darko Marinov, and Gul Agha. 2005. CUTE: A Concolic Unit Testing Engine for C. InProceedings of the 13th ACM SIGSOFT Symposium on Foundations of Software Engineering. 263–272

  23. [23]

    Yan Shoshitaishvili, Ruoyu Wang, Christopher Hauser, Christopher Kruegel, and Giovanni Vigna. 2015. Concretely Mapped Memory for Symbolic Execution. In Proceedings of the 22nd ACM SIGSAC Conference on Computer and Communications Security

  24. [24]

    Yuqiang Sun et al. 2024. LLM4Vuln: A Unified Evaluation Framework for Decou- pling and Enhancing LLMs’ Vulnerability Reasoning. InProceedings of the 33rd USENIX Security Symposium

  25. [25]

    Michele Tufano et al. 2022. AthenaTest: Unit Test Generation Using LLMs. In Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering

  26. [26]

    Alexa VanHattum, Daniel Schwartz-Narbonne, Nathan Chong, and Adrian Samp- son. 2022. Kani: Catching Bugs, Automatically, in Safe Rust. InProceedings of the International Conference on Software Engineering: Software Engineering in Practice. ACM, 138–149

  27. [27]

    Junjie Wang et al. 2023. A Survey on Large Language Models for Software Testing. arXiv preprint arXiv:2307.07221(2023)

  28. [28]

    Zhun Xu et al. 2024. AutoAttacker: A Large Language Model Guided System to Implement Automatic Cyber-attacks. InWorkshop on Large Language Models for Code at ICLR

  29. [29]

    He Ye, Matias Martinez, and Martin Monperrus. 2022. VulnFix: Fast Patch Vali- dation via Symbolic Execution. InProceedings of the 37th IEEE/ACM International Symbolic Execution Meets Multi-LLM Orchestration: Detecting Memory Vulnerabilities in Incomplete Rust CVE Snippets SSE ’26, June 9–12, 2026, Glasgow, Scotland, United Kingdom Conference on Automated ...

  30. [30]

    Jian Zhang et al. 2024. Multi-Agent Software Engineering: A Survey. InProceed- ings of the ACM/IEEE International Conference on Software Engineering. ACM

  31. [31]

    Yaojie Zheng et al. 2023. Large Language Models for Test Input Generation in Symbolic Execution. InProceedings of the 38th IEEE/ACM International Conference on Automated Software Engineering