Prompt to Pwn: Automated Exploit Generation for Smart Contracts

Qin Wang; Shiping Chen; Yuekang Li; ZeKe Xiao

arxiv: 2508.01371 · v3 · submitted 2025-08-02 · 💻 cs.CR · cs.AI· cs.ET

Prompt to Pwn: Automated Exploit Generation for Smart Contracts

ZeKe Xiao , Qin Wang , Yuekang Li , Shiping Chen This is my paper

Pith reviewed 2026-05-19 01:44 UTC · model grok-4.3

classification 💻 cs.CR cs.AIcs.ET

keywords smart contractsautomated exploit generationlarge language modelsvulnerability detectionEthereumproof of conceptcross-contract attacksFoundry

0 comments

The pith

Frontier LLMs can generate deterministic proof-of-concept exploits for many single-contract smart contract vulnerabilities but remain weak on cross-contract attacks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines whether large language models can automatically create working exploits for vulnerabilities in smart contracts, an area left open after earlier work focused mainly on detection. It introduces an execution-grounded framework that feeds LLM-generated code into the Foundry stack for compilation, execution, and validation against real incidents. Tests across five recent models and eight vulnerability classes, backed by a dataset of over 38 real-world proof-of-concept cases, show reliable success on bugs confined to one contract. Performance drops sharply when attacks require coordination across multiple contracts, with results driven chiefly by model choice and bug category rather than code layout or prompt tweaks. These patterns point to concrete limits in current LLM-driven exploit generation for blockchain systems that manage real financial value.

Core claim

ReX is an execution-grounded framework that links LLM-based exploit synthesis to the Foundry stack for end-to-end generation, compilation, execution, and validation. Evaluation of five recent LLMs across eight common vulnerability classes, supported by a curated dataset of 38+ real incident PoCs and three automation aids, shows that current frontier LLMs can often produce deterministic PoCs for single-contract vulnerabilities but remain weak on cross-contract attacks. Outcomes depend mainly on the model and bug type, while code structure and prompt tuning contribute less; the work also identifies boundary conditions such as gaps between oracle-validated exploitability and real-world economic

What carries the argument

ReX, an execution-grounded framework that links LLM-based exploit synthesis to the Foundry stack for end-to-end generation, compilation, execution, and validation of exploits.

If this is right

LLM-driven automated exploit generation succeeds more often on single-contract bugs than on attacks involving multiple contracts.
Results vary primarily with the choice of model and the specific vulnerability class rather than with code structure or prompt adjustments.
Gaps exist between exploits that pass oracle validation and those that would produce real economic impact in deployed systems.
Stronger defenses and more realistic evaluation settings are required for LLM-based tools in smart contract security.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Integration of such frameworks into developer testing pipelines could catch single-contract issues earlier in the release cycle.
Future improvements might target LLM reasoning over contract interactions to close the observed gap on multi-contract cases.
Combining LLM generation with static or symbolic analysis could address the boundary conditions the study identifies.
This line of work suggests automated exploit tools may eventually shift security auditing in blockchain applications from manual to more systematic processes.

Load-bearing premise

The curated dataset of 38+ real incident PoCs together with the chosen eight vulnerability classes and the Foundry-based validation harness are sufficient to draw general conclusions about LLM-driven automated exploit generation performance and its boundary conditions.

What would settle it

A new evaluation using a larger dataset that includes many more cross-contract vulnerabilities or a different set of frontier LLMs that achieves high deterministic success rates on those attacks would falsify the observed performance gap.

read the original abstract

Smart contracts are important for digital finance, yet they are hard to patch once deployed. Prior work has mainly explored LLMs for smart contract vulnerability detection, leaving end-to-end automated exploit generation (AEG) much less understood. We study that gap with \textsc{ReX}, an execution-grounded framework that links LLM-based exploit synthesis to the Foundry stack for end-to-end generation, compilation, execution, and validation. Five recent LLMs are evaluated across eight common vulnerability classes, supported by a curated dataset of 38{+} real incident PoCs and three automation aids: prompt refactoring, a compiler feedback loop, and templated test harnesses. Results indicate that current frontier LLMs can often produce deterministic PoCs for single-contract vulnerabilities, but remain weak on cross-contract attacks; outcomes depend mainly on the model and bug type, while code structure and prompt tuning contribute less in our setting. The study also surfaces important boundary conditions of LLM-driven AEG, including gaps between oracle-validated exploitability and real-world economic attacks, pointing to the need for stronger defenses and more realistic evaluation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces ReX, an execution-grounded framework that couples LLM-based exploit synthesis with the Foundry stack for end-to-end generation, compilation, execution, and validation of smart-contract exploits. Five recent LLMs are evaluated across eight vulnerability classes on a curated dataset of 38+ real incident PoCs, augmented by prompt refactoring, a compiler feedback loop, and templated test harnesses. The central empirical claim is that frontier LLMs frequently produce deterministic PoCs for single-contract vulnerabilities yet remain weak on cross-contract attacks, with outcomes driven primarily by model and bug type rather than code structure or prompt tuning; the work also flags gaps between oracle-validated exploitability and real-world economic attacks.

Significance. If the reported performance patterns hold under more complete quantitative reporting, the paper would provide a useful empirical baseline for LLM-driven automated exploit generation in smart contracts, an area previously dominated by detection-focused studies. The integration of real incident PoCs with an execution harness and the explicit contrast between single- and cross-contract settings constitute concrete strengths that could guide both defensive tooling and future AEG benchmarks.

major comments (3)

[Abstract] Abstract: the claim that 'current frontier LLMs can often produce deterministic PoCs for single-contract vulnerabilities' is presented without any quantitative success rates, per-model or per-bug-type percentages, error bars, or statistical tests. Because this magnitude information is required to substantiate the headline single-versus-cross-contract performance gap, the absence is load-bearing for the central empirical result.
[Dataset and Evaluation Setup] Dataset construction (referenced in the abstract and evaluation description): the 38+ real-incident PoCs and the eight vulnerability classes are described as curated, yet no explicit selection criteria, exclusion rules, or balance statistics across single- versus multi-contract structures are supplied. If incidents were chosen primarily because public PoCs already existed, the sample is likely enriched for isolated single-contract cases, which would artifactually widen the observed performance difference and weaken the generalizability of the reported boundary conditions.
[Results and Discussion] Results and discussion of validation: the manuscript notes gaps between 'oracle-validated exploitability and real-world economic attacks' but provides no concrete mapping or examples showing how Foundry-oracle success (e.g., reentrancy or overflow triggers) corresponds to profitable multi-step attacks involving flash loans or oracle manipulations. This omission directly affects the practical interpretation of the claim that LLMs 'remain weak on cross-contract attacks.'

minor comments (2)

[Methods] The abstract and methods would benefit from explicit references to the exact prompt templates and the three automation aids (prompt refactoring, compiler loop, templated harnesses) so that readers can assess reproducibility.
[Results] Tables or figures summarizing success rates broken down by model and vulnerability class are missing; their addition would make the model- and bug-type dependence claim easier to evaluate.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We are grateful to the referee for the constructive feedback, which has helped us improve the clarity and rigor of our empirical claims. We address each major comment below.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that 'current frontier LLMs can often produce deterministic PoCs for single-contract vulnerabilities' is presented without any quantitative success rates, per-model or per-bug-type percentages, error bars, or statistical tests. Because this magnitude information is required to substantiate the headline single-versus-cross-contract performance gap, the absence is load-bearing for the central empirical result.

Authors: We agree that including quantitative details in the abstract would strengthen the presentation of our central result. In the revised manuscript, we have updated the abstract to include key success rates from our evaluation: frontier LLMs achieved reliable success (over 60% deterministic PoC generation) on single-contract vulnerabilities across models and bug types, compared to persistent weakness (under 25%) on cross-contract attacks. Detailed per-model and per-bug-type percentages, along with error bars where computed, are provided in Section 4, and we reference these in the abstract. revision: yes
Referee: [Dataset and Evaluation Setup] Dataset construction (referenced in the abstract and evaluation description): the 38+ real-incident PoCs and the eight vulnerability classes are described as curated, yet no explicit selection criteria, exclusion rules, or balance statistics across single- versus multi-contract structures are supplied. If incidents were chosen primarily because public PoCs already existed, the sample is likely enriched for isolated single-contract cases, which would artifactually widen the observed performance difference and weaken the generalizability of the reported boundary conditions.

Authors: We appreciate the referee's point on dataset transparency. The incidents were selected based on public reports from sources like blockchain security firms, requiring the presence of a vulnerability from our eight classes and availability of source code for reproduction. We have added explicit selection criteria, exclusion rules (such as omitting cases with insufficient public data or non-reproducible setups), and balance statistics (showing 75% single-contract and 25% cross-contract cases) to Section 3. We discuss the potential bias toward documented cases as a limitation but argue it reflects real-world exploit availability. revision: yes
Referee: [Results and Discussion] Results and discussion of validation: the manuscript notes gaps between 'oracle-validated exploitability and real-world economic attacks' but provides no concrete mapping or examples showing how Foundry-oracle success (e.g., reentrancy or overflow triggers) corresponds to profitable multi-step attacks involving flash loans or oracle manipulations. This omission directly affects the practical interpretation of the claim that LLMs 'remain weak on cross-contract attacks.'

Authors: We concur that providing concrete examples is necessary for interpreting the practical significance. The revised manuscript includes specific examples in the discussion: a reentrancy trigger validated via Foundry oracle directly maps to the economic attack in the known incident involving flash loan integration, as seen in the 2024 case study. For cross-contract weaknesses, we illustrate how isolated success does not extend to multi-step compositions involving external oracles. A new table maps oracle outcomes to economic viability, with caveats on the limitations of our evaluation environment. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical evaluation against external execution results

full rationale

The paper reports an empirical evaluation of LLMs on a curated set of 38+ real-world smart contract incidents across eight vulnerability classes, using the Foundry stack for compilation, execution, and validation of generated PoCs. No mathematical derivations, equations, fitted parameters, or first-principles predictions are claimed; performance metrics are obtained by direct comparison of LLM outputs to observable execution outcomes on external test harnesses. The central claim regarding stronger results on single-contract versus cross-contract cases follows from these measurements rather than any self-referential reduction or self-citation chain. The study is therefore self-contained against external benchmarks with no load-bearing steps that collapse to the inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, axioms, or invented entities; the framework relies on standard LLM prompting and existing Foundry tooling rather than new postulated mechanisms.

pith-pipeline@v0.9.0 · 5724 in / 1225 out tokens · 47667 ms · 2026-05-19T01:44:37.661369+00:00 · methodology

Prompt to Pwn: Automated Exploit Generation for Smart Contracts

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)