Prompt to Pwn: Automated Exploit Generation for Smart Contracts
Pith reviewed 2026-05-19 01:44 UTC · model grok-4.3
The pith
Frontier LLMs can generate deterministic proof-of-concept exploits for many single-contract smart contract vulnerabilities but remain weak on cross-contract attacks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
ReX is an execution-grounded framework that links LLM-based exploit synthesis to the Foundry stack for end-to-end generation, compilation, execution, and validation. Evaluation of five recent LLMs across eight common vulnerability classes, supported by a curated dataset of 38+ real incident PoCs and three automation aids, shows that current frontier LLMs can often produce deterministic PoCs for single-contract vulnerabilities but remain weak on cross-contract attacks. Outcomes depend mainly on the model and bug type, while code structure and prompt tuning contribute less; the work also identifies boundary conditions such as gaps between oracle-validated exploitability and real-world economic
What carries the argument
ReX, an execution-grounded framework that links LLM-based exploit synthesis to the Foundry stack for end-to-end generation, compilation, execution, and validation of exploits.
If this is right
- LLM-driven automated exploit generation succeeds more often on single-contract bugs than on attacks involving multiple contracts.
- Results vary primarily with the choice of model and the specific vulnerability class rather than with code structure or prompt adjustments.
- Gaps exist between exploits that pass oracle validation and those that would produce real economic impact in deployed systems.
- Stronger defenses and more realistic evaluation settings are required for LLM-based tools in smart contract security.
Where Pith is reading between the lines
- Integration of such frameworks into developer testing pipelines could catch single-contract issues earlier in the release cycle.
- Future improvements might target LLM reasoning over contract interactions to close the observed gap on multi-contract cases.
- Combining LLM generation with static or symbolic analysis could address the boundary conditions the study identifies.
- This line of work suggests automated exploit tools may eventually shift security auditing in blockchain applications from manual to more systematic processes.
Load-bearing premise
The curated dataset of 38+ real incident PoCs together with the chosen eight vulnerability classes and the Foundry-based validation harness are sufficient to draw general conclusions about LLM-driven automated exploit generation performance and its boundary conditions.
What would settle it
A new evaluation using a larger dataset that includes many more cross-contract vulnerabilities or a different set of frontier LLMs that achieves high deterministic success rates on those attacks would falsify the observed performance gap.
read the original abstract
Smart contracts are important for digital finance, yet they are hard to patch once deployed. Prior work has mainly explored LLMs for smart contract vulnerability detection, leaving end-to-end automated exploit generation (AEG) much less understood. We study that gap with \textsc{ReX}, an execution-grounded framework that links LLM-based exploit synthesis to the Foundry stack for end-to-end generation, compilation, execution, and validation. Five recent LLMs are evaluated across eight common vulnerability classes, supported by a curated dataset of 38{+} real incident PoCs and three automation aids: prompt refactoring, a compiler feedback loop, and templated test harnesses. Results indicate that current frontier LLMs can often produce deterministic PoCs for single-contract vulnerabilities, but remain weak on cross-contract attacks; outcomes depend mainly on the model and bug type, while code structure and prompt tuning contribute less in our setting. The study also surfaces important boundary conditions of LLM-driven AEG, including gaps between oracle-validated exploitability and real-world economic attacks, pointing to the need for stronger defenses and more realistic evaluation.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces ReX, an execution-grounded framework that couples LLM-based exploit synthesis with the Foundry stack for end-to-end generation, compilation, execution, and validation of smart-contract exploits. Five recent LLMs are evaluated across eight vulnerability classes on a curated dataset of 38+ real incident PoCs, augmented by prompt refactoring, a compiler feedback loop, and templated test harnesses. The central empirical claim is that frontier LLMs frequently produce deterministic PoCs for single-contract vulnerabilities yet remain weak on cross-contract attacks, with outcomes driven primarily by model and bug type rather than code structure or prompt tuning; the work also flags gaps between oracle-validated exploitability and real-world economic attacks.
Significance. If the reported performance patterns hold under more complete quantitative reporting, the paper would provide a useful empirical baseline for LLM-driven automated exploit generation in smart contracts, an area previously dominated by detection-focused studies. The integration of real incident PoCs with an execution harness and the explicit contrast between single- and cross-contract settings constitute concrete strengths that could guide both defensive tooling and future AEG benchmarks.
major comments (3)
- [Abstract] Abstract: the claim that 'current frontier LLMs can often produce deterministic PoCs for single-contract vulnerabilities' is presented without any quantitative success rates, per-model or per-bug-type percentages, error bars, or statistical tests. Because this magnitude information is required to substantiate the headline single-versus-cross-contract performance gap, the absence is load-bearing for the central empirical result.
- [Dataset and Evaluation Setup] Dataset construction (referenced in the abstract and evaluation description): the 38+ real-incident PoCs and the eight vulnerability classes are described as curated, yet no explicit selection criteria, exclusion rules, or balance statistics across single- versus multi-contract structures are supplied. If incidents were chosen primarily because public PoCs already existed, the sample is likely enriched for isolated single-contract cases, which would artifactually widen the observed performance difference and weaken the generalizability of the reported boundary conditions.
- [Results and Discussion] Results and discussion of validation: the manuscript notes gaps between 'oracle-validated exploitability and real-world economic attacks' but provides no concrete mapping or examples showing how Foundry-oracle success (e.g., reentrancy or overflow triggers) corresponds to profitable multi-step attacks involving flash loans or oracle manipulations. This omission directly affects the practical interpretation of the claim that LLMs 'remain weak on cross-contract attacks.'
minor comments (2)
- [Methods] The abstract and methods would benefit from explicit references to the exact prompt templates and the three automation aids (prompt refactoring, compiler loop, templated harnesses) so that readers can assess reproducibility.
- [Results] Tables or figures summarizing success rates broken down by model and vulnerability class are missing; their addition would make the model- and bug-type dependence claim easier to evaluate.
Simulated Author's Rebuttal
We are grateful to the referee for the constructive feedback, which has helped us improve the clarity and rigor of our empirical claims. We address each major comment below.
read point-by-point responses
-
Referee: [Abstract] Abstract: the claim that 'current frontier LLMs can often produce deterministic PoCs for single-contract vulnerabilities' is presented without any quantitative success rates, per-model or per-bug-type percentages, error bars, or statistical tests. Because this magnitude information is required to substantiate the headline single-versus-cross-contract performance gap, the absence is load-bearing for the central empirical result.
Authors: We agree that including quantitative details in the abstract would strengthen the presentation of our central result. In the revised manuscript, we have updated the abstract to include key success rates from our evaluation: frontier LLMs achieved reliable success (over 60% deterministic PoC generation) on single-contract vulnerabilities across models and bug types, compared to persistent weakness (under 25%) on cross-contract attacks. Detailed per-model and per-bug-type percentages, along with error bars where computed, are provided in Section 4, and we reference these in the abstract. revision: yes
-
Referee: [Dataset and Evaluation Setup] Dataset construction (referenced in the abstract and evaluation description): the 38+ real-incident PoCs and the eight vulnerability classes are described as curated, yet no explicit selection criteria, exclusion rules, or balance statistics across single- versus multi-contract structures are supplied. If incidents were chosen primarily because public PoCs already existed, the sample is likely enriched for isolated single-contract cases, which would artifactually widen the observed performance difference and weaken the generalizability of the reported boundary conditions.
Authors: We appreciate the referee's point on dataset transparency. The incidents were selected based on public reports from sources like blockchain security firms, requiring the presence of a vulnerability from our eight classes and availability of source code for reproduction. We have added explicit selection criteria, exclusion rules (such as omitting cases with insufficient public data or non-reproducible setups), and balance statistics (showing 75% single-contract and 25% cross-contract cases) to Section 3. We discuss the potential bias toward documented cases as a limitation but argue it reflects real-world exploit availability. revision: yes
-
Referee: [Results and Discussion] Results and discussion of validation: the manuscript notes gaps between 'oracle-validated exploitability and real-world economic attacks' but provides no concrete mapping or examples showing how Foundry-oracle success (e.g., reentrancy or overflow triggers) corresponds to profitable multi-step attacks involving flash loans or oracle manipulations. This omission directly affects the practical interpretation of the claim that LLMs 'remain weak on cross-contract attacks.'
Authors: We concur that providing concrete examples is necessary for interpreting the practical significance. The revised manuscript includes specific examples in the discussion: a reentrancy trigger validated via Foundry oracle directly maps to the economic attack in the known incident involving flash loan integration, as seen in the 2024 case study. For cross-contract weaknesses, we illustrate how isolated success does not extend to multi-step compositions involving external oracles. A new table maps oracle outcomes to economic viability, with caveats on the limitations of our evaluation environment. revision: yes
Circularity Check
No circularity: empirical evaluation against external execution results
full rationale
The paper reports an empirical evaluation of LLMs on a curated set of 38+ real-world smart contract incidents across eight vulnerability classes, using the Foundry stack for compilation, execution, and validation of generated PoCs. No mathematical derivations, equations, fitted parameters, or first-principles predictions are claimed; performance metrics are obtained by direct comparison of LLM outputs to observable execution outcomes on external test harnesses. The central claim regarding stronger results on single-contract versus cross-contract cases follows from these measurements rather than any self-referential reduction or self-citation chain. The study is therefore self-contained against external benchmarks with no load-bearing steps that collapse to the inputs by construction.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.