CyberChainBench: Can AI Agents Secure Smart Contracts Against Real-World On-Chain Vulnerabilities?

Fengqing Jiang; Jintao Huang; Radha Poovendran; Zhiqiang Lin

arxiv: 2606.26216 · v1 · pith:2CSWJVHAnew · submitted 2026-06-24 · 💻 cs.CR · cs.AI

CyberChainBench: Can AI Agents Secure Smart Contracts Against Real-World On-Chain Vulnerabilities?

Jintao Huang , Fengqing Jiang , Radha Poovendran , Zhiqiang Lin This is my paper

Pith reviewed 2026-06-26 01:44 UTC · model grok-4.3

classification 💻 cs.CR cs.AI

keywords smart contract securityLLM agentsvulnerability detectionexploit generationpatch synthesisDeFi incidentson-chain evaluationblockchain benchmark

0 comments

The pith

AI agents detect 37.5% of real smart contract vulnerabilities, generate exploits in 43.7% of cases, and patch 23.4%, with top runs netting $57.4M in simulated profit.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces CyberChainBench, a new evaluation set built from 541 historical DeFi exploit incidents across multiple chains. It tests LLM agents on three linked tasks using live blockchain state in isolated forks: spotting the flaw, writing an exploit that extracts value, and producing a patch that stops the attack while preserving normal transactions. Results show a clear ordering of difficulty, with exploitation outperforming detection and patching lagging furthest behind. The top agent earns substantial total profit across the 200-case subset at very low per-case cost. A reader would care because these contracts manage real funds and the numbers quantify how far current agents are from reliable automated defense.

Core claim

The paper establishes CyberChainBench as an end-to-end on-chain benchmark drawn from 541 real incidents and evaluates multiple agent-model pairs across vulnerability detection, exploit generation, and patch synthesis. The strongest configuration reaches 37.5% detection, 43.7% successful exploitation, and 23.4% patching on the 200-case exploitation subset; the leading agent (Codex with GPT-5.5) produces $57.4M in total historical profit at $2.39 average cost per case. Exploits are scored by actual economic impact on mainnet forks, and patches are checked by replaying both attack and legitimate transactions on upgradeable contracts.

What carries the argument

CyberChainBench with its Harbor-orchestrated mainnet-fork environments, five-type vulnerability taxonomy, and structured ground truth that ties each case to a specific block, vulnerability location, and attacker profit.

If this is right

Current agents succeed more often at generating profitable exploits than at localizing vulnerabilities or synthesizing patches.
One agent configuration can accumulate tens of millions in simulated exploit value across hundreds of cases at under three dollars per attempt.
Patch synthesis remains the hardest of the three tasks even when agents have access to the same tools and historical state.
Economic-impact grading on real forks provides a concrete, money-denominated metric that distinguishes working exploits from theoretical ones.
Replay of both attack and legitimate transactions supplies an automatic oracle for validating whether a patch actually stops the vulnerability.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the measured gap persists, agent tooling may currently be more immediately useful for red-team analysis than for automated blue-team patching.
Improving patch performance would likely require new mechanisms for reasoning about state invariants and upgrade safety rather than simply scaling existing detection or exploitation pipelines.
The benchmark's per-case cost figures suggest that offensive use of agents could become economically attractive at scale if success rates rise only modestly.
Extending the evaluation to non-EVM chains or to incidents after the current data cutoff would test whether the observed difficulty ordering generalizes.

Load-bearing premise

The 541 incidents and the 200-case subset are an unbiased sample of real vulnerabilities and the isolated fork environments reproduce historical on-chain behavior without artifacts or leakage.

What would settle it

Running the identical agent configurations on a fresh collection of post-cutoff incidents and observing materially lower success rates or profit totals than the reported figures.

read the original abstract

We present CyberChainBench, a benchmark for evaluating LLM-based agents on smart contract security across three complementary tasks: vulnerability detection, exploit generation, and patch synthesis. Built from 541 real-world exploit incidents from DeFiHackLabs spanning 9 EVM chains, the benchmark provides end-to-end on-chain evaluation where agents interact with historical blockchain state through isolated evaluation environments orchestrated by Harbor, using tools to read code, trace transactions, and validate exploits on mainnet forks. Each case is anchored to a specific block and includes structured ground truth covering vulnerability type, localization, and attacker profit. Exploits are graded by economic impact on historical forks; patches are validated by replaying historical attacks and legitimate transactions as fail-to-pass test oracles on a proxy-upgradeable subset. We define a five-type vulnerability taxonomy and evaluate multiple agent--model configurations. Results reveal a clear difficulty gradient: the best configuration scores 37.5% on detection, 43.7% on exploitation, but only 23.4% on patching, with the top agent (Codex with GPT-5.5) realizing \$57.4M in total exploit profit across the 200-case exploit set at a cost of $2.39 per case.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CyberChainBench brings real historical incidents into an end-to-end AI agent benchmark for smart contract security, but the exploitation results are likely inflated by public data leakage into model training.

read the letter

The main thing to know is that CyberChainBench is a new benchmark using real DeFi hack incidents for testing AI agents on vulnerability detection, exploit generation, and patching, with on-chain evaluation, but the results are probably compromised by data leakage from public sources.

The paper does a few things right. It pulls from 541 actual incidents across nine EVM chains and sets up end-to-end tasks with economic impact grading on historical forks using Harbor environments. The reported numbers show a sensible progression—37.5% detection, 43.7% exploitation, 23.4% patching—and the $57.4M total profit figure gives a tangible sense of scale. Grounding everything to specific blocks and using real transaction traces is better than most synthetic setups.

The soft spots are noticeable. The abstract lacks detail on how the 200-case subset was picked from the 541 or how the ground truth was verified. More critically, the stress-test point about contamination holds weight. DeFiHackLabs is public, with code, traces, and profit numbers all online. Models trained on web data could easily recall these specifics rather than reason from the state provided in the tool interface. The evaluation setup does not appear to test for or block that, so the exploitation success rate and profit numbers may not reflect genuine agent capability on unseen vulnerabilities.

This paper is aimed at people building or evaluating AI tools for smart contract security. It would be useful for anyone tracking progress in that area or designing similar benchmarks. The thinking seems straightforward and engaged with the domain, without obvious internal contradictions.

I would send it to peer review. The benchmark concept is worth referee scrutiny, but the authors need to tackle the leakage risk and fill in the missing method details before it can be taken as solid evidence.

Referee Report

3 major / 1 minor

Summary. The paper introduces CyberChainBench, a benchmark for LLM-based agents on smart contract security across vulnerability detection, exploit generation, and patch synthesis. It is constructed from 541 real-world incidents in the public DeFiHackLabs repository spanning 9 EVM chains, with agents interacting via tools in isolated Harbor environments on historical mainnet forks. Exploits are scored by realized economic impact on historical states, and patches are validated via replay oracles on a proxy-upgradeable subset. The best agent-model configuration reports 37.5% detection, 43.7% exploitation, and 23.4% patching, with the top agent (Codex + GPT-5.5) achieving $57.4M aggregate exploit profit across the 200-case subset at $2.39 per case.

Significance. If the empirical results hold after addressing selection and leakage concerns, the benchmark would be a valuable contribution to the field by supplying an economically grounded, on-chain evaluation of agent capabilities against documented real-world vulnerabilities. The difficulty gradient across the three tasks and the use of profit-based grading on historical forks provide concrete, falsifiable metrics that go beyond synthetic test suites.

major comments (3)

[Abstract] Abstract: The manuscript states that a 200-case subset is used for exploitation evaluation but provides no information on selection criteria from the 541 incidents, stratification by chain/vulnerability type, or statistical power. This directly affects the reliability of the 43.7% exploitation rate and $57.4M profit aggregate.
[Abstract] Abstract: The benchmark relies exclusively on publicly documented DeFiHackLabs incidents (including code, traces, and profit figures). No discussion or mitigation is provided for the risk that evaluated models (including GPT-5.5) may have encountered these incidents during pretraining, which would allow success via memorization rather than analysis of the supplied state in the Harbor environments. This is load-bearing for the validity of the exploitation and profit results.
[Abstract] Abstract: Ground-truth construction for vulnerability type, localization, and attacker profit is asserted but not described (e.g., manual annotation protocol, inter-annotator agreement, or use of on-chain transaction data). Without this, it is impossible to assess whether the reported scores measure the claimed capabilities.

minor comments (1)

[Abstract] The abstract should clarify whether the $2.39 per-case cost includes only model API usage or also the computational overhead of the Harbor mainnet-fork environments.

Simulated Author's Rebuttal

3 responses · 1 unresolved

We thank the referee for the constructive feedback on subset selection, data leakage risks, and ground-truth construction. We address each point below and indicate planned revisions where appropriate.

read point-by-point responses

Referee: [Abstract] Abstract: The manuscript states that a 200-case subset is used for exploitation evaluation but provides no information on selection criteria from the 541 incidents, stratification by chain/vulnerability type, or statistical power. This directly affects the reliability of the 43.7% exploitation rate and $57.4M profit aggregate.

Authors: We agree the abstract omits these details. The full manuscript (Section 4.1) states the 200-case subset was drawn from the 541 incidents to manage evaluation cost while preserving diversity, but does not specify the exact sampling procedure, stratification, or power analysis. We will revise the abstract, Section 4, and add an appendix with the selection criteria (including chain and vulnerability-type stratification) plus a brief statistical power discussion. revision: yes
Referee: [Abstract] Abstract: The benchmark relies exclusively on publicly documented DeFiHackLabs incidents (including code, traces, and profit figures). No discussion or mitigation is provided for the risk that evaluated models (including GPT-5.5) may have encountered these incidents during pretraining, which would allow success via memorization rather than analysis of the supplied state in the Harbor environments. This is load-bearing for the validity of the exploitation and profit results.

Authors: We acknowledge this is a substantive concern not addressed in the current manuscript. The evaluation uses historical mainnet forks and requires agents to interact with live contract state via tools, which in principle demands context-specific reasoning rather than pure recall. However, we cannot verify or eliminate pretraining exposure for closed models. We will add an explicit limitations paragraph discussing the memorization risk and will note that future iterations could incorporate post-cutoff models or private incidents where feasible. revision: partial
Referee: [Abstract] Abstract: Ground-truth construction for vulnerability type, localization, and attacker profit is asserted but not described (e.g., manual annotation protocol, inter-annotator agreement, or use of on-chain transaction data). Without this, it is impossible to assess whether the reported scores measure the claimed capabilities.

Authors: The ground truth is taken directly from the structured annotations already present in the public DeFiHackLabs repository, which itself derives vulnerability type, location, and profit figures from the original on-chain transactions and post-mortem reports. The manuscript does not describe the extraction or validation process. We will expand Section 3 to include a concise description of how these fields were mapped from the repository and any manual verification steps performed by the authors. revision: yes

standing simulated objections not resolved

Complete verification or elimination of pretraining leakage for proprietary models, as the authors lack access to training data or cutoff dates.

Circularity Check

0 steps flagged

No circularity: direct empirical benchmark on external incidents

full rationale

The paper constructs a benchmark from 541 external DeFiHackLabs incidents and evaluates agent performance via direct measurements on historical blockchain states in isolated Harbor environments. No equations, predictions, or claims reduce by construction to fitted parameters or self-referential definitions. Results (detection 37.5%, exploitation 43.7%, patching 23.4%) are reported as empirical outcomes with ground-truth anchors, not derived quantities. No self-citation load-bearing steps or ansatzes appear in the provided text. The derivation chain is self-contained against the external incident data.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on the assumption that the external DeFiHackLabs dataset accurately captures real vulnerabilities and that the Harbor-based fork evaluation faithfully reproduces historical behavior; no free parameters are fitted to produce the reported scores.

axioms (1)

domain assumption Historical EVM blockchain states can be forked and replayed accurately enough for exploit validation and patch testing
Invoked for the on-chain evaluation environments and economic impact grading described in the abstract.

pith-pipeline@v0.9.1-grok · 5762 in / 1481 out tokens · 24995 ms · 2026-06-26T01:44:48.130639+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

31 extracted references · 1 canonical work pages

[1]

Online decompiler, accessed May 2026

Ethervm: Ethereum virtual machine bytecode decompiler.https://ethervm.io/, 2026. Online decompiler, accessed May 2026

2026
[2]

Uniswap v3 core.Uniswap Whitepaper, 2021

Hayden Adams, Noah Zinsmeister, Moody Salem, River Keefer, and Dan Robinson. Uniswap v3 core.Uniswap Whitepaper, 2021

2021
[3]

Alchemy: The web3 development platform, 2024

Alchemy. Alchemy: The web3 development platform, 2024. URLhttps://www.alchemy. com/

2024
[4]

Model context protocol.https://modelcontextprotocol.io/, 2024

Anthropic. Model context protocol.https://modelcontextprotocol.io/, 2024. Open standard for connecting LLMs with external tools and data sources. 12

2024
[5]

BNBSmartChainwhitepaper

BNBChain. BNBSmartChainwhitepaper. https://github.com/bnb-chain/whitepaper, 2020

2020
[6]

Smart- poc: Generating executable and validated pocs for smart contract bug reports.arXiv preprint arXiv:2511.12993, 2025

Longfei Chen, Ruibin Yan, Taiyu Wong, Yiyang Chen, Jialai Wang, and Chao Zhang. Smart- poc: Generating executable and validated pocs for smart contract bug reports.arXiv preprint arXiv:2511.12993, 2025

arXiv 2025
[7]

Etherscan: The Ethereum blockchain explorer.https://etherscan.io/, 2024

Etherscan. Etherscan: The Ethereum blockchain explorer.https://etherscan.io/, 2024. Accessed 2026

2024
[8]

Slither: Astaticanalysisframeworkforsmartcon- tracts.Proceedings of the 2nd International Workshop on Emerging Trends in Software Engineering for Blockchain (WETSEB), 2019

JosselinFeist, GustavoGrieco, andAlexGroce. Slither: Astaticanalysisframeworkforsmartcon- tracts.Proceedings of the 2nd International Workshop on Emerging Trends in Software Engineering for Blockchain (WETSEB), 2019

2019
[9]

AI agent smart contract exploit generation.arXiv preprint arXiv:2507.05558, 2025

Arthur Gervais and Liyi Zhou. AI agent smart contract exploit generation.arXiv preprint arXiv:2507.05558, 2025

arXiv 2025
[10]

Echidna: Effective, usable, andfastfuzzingforsmartcontracts

Gustavo Grieco, Will Song, Artur Cygan, Josselin Feist, and Alex Groce. Echidna: Effective, usable, andfastfuzzingforsmartcontracts. InProceedingsofthe29thACMSIGSOFTInternational Symposium on Software Testing and Analysis (ISSTA), 2020

2020
[11]

Harbor: Sandboxed agent evaluation framework.https://github.com/ harbor-framework/harbor, 2025

Harbor Framework. Harbor: Sandboxed agent evaluation framework.https://github.com/ harbor-framework/harbor, 2025. Docker-based evaluation orchestration for LLM agents, accessed 2026

2025
[12]

Kalodner, Steven Goldfeder, Xiaoqi Chen, S

Harry A. Kalodner, Steven Goldfeder, Xiaoqi Chen, S. Matthew Weinberg, and Edward W. Felten. Arbitrum: Scalable, private smart contracts.Proceedings of the 27th USENIX Security Symposium, 2018

2018
[13]

Merrill, Alexander G

Mike A. Merrill, Alexander G. Shaw, Nicholas Carlini, Boxuan Li, Harsh Raj, Ivan Bercovich, Lin Shi, Jeong Yeon Shin, Thomas Walshe, E. Kelly Buchanan, Junhong Shen, Guanghao Ye, Haowei Lin, Jason Poulos, Maoyu Wang, Marianna Nezhurina, Jenia Jitsev, Di Lu, Orfeas Menis Mastromichalakis, Zhiwei Xu, Zizhao Chen, Yue Liu, Robert Zhang, Leon Liangyu Chen, An...

Pith/arXiv arXiv 2026
[14]

Smashing Ethereum smart contracts for fun and real profit

Bernhard Mueller. Smashing Ethereum smart contracts for fun and real profit. InHITB Security Conference, 2018

2018
[15]

EIP-1967: Proxy storage slots.https://eips.ethereum.org/EIPS/ eip-1967, 2019

Santiago Palladino. EIP-1967: Proxy storage slots.https://eips.ethereum.org/EIPS/ eip-1967, 2019. Ethereum Improvement Proposal

1967
[16]

Pancakeswap: Decentralized exchange on BNB chain.https://pancakeswap

PancakeSwap. Pancakeswap: Decentralized exchange on BNB chain.https://pancakeswap. finance, 2021. Accessed 2026. 13

2021
[17]

Foundry: A blazing fast, portable and modular toolkit for Ethereum application development.https://github.com/foundry-rs/foundry, 2024

Paradigm. Foundry: A blazing fast, portable and modular toolkit for Ethereum application development.https://github.com/foundry-rs/foundry, 2024. Accessed 2026

2024
[18]

Polygon: Ethereum’s internet of blockchains.https://polygon.technology/ papers/pol-whitepaper, 2021

Polygon Labs. Polygon: Ethereum’s internet of blockchains.https://polygon.technology/ papers/pol-whitepaper, 2021

2021
[19]

LISA technical report: An agentic framework for smart contract auditing.arXiv preprint arXiv:2509.24698, 2025

Izaiah Sun, Daniel Tan, and Andy Deng. LISA technical report: An agentic framework for smart contract auditing.arXiv preprint arXiv:2509.24698, 2025

arXiv 2025
[20]

GPTScan: Detecting logic vulnerabilities in smart contracts by combining GPT with program analysis

Yuqiang Sun, Daoyuan Wu, Yue Xue, Han Liu, Haijun Wang, Zhengzi Xu, Xiaofei Xie, and Yang Liu. GPTScan: Detecting logic vulnerabilities in smart contracts by combining GPT with program analysis. InProceedings of the 46th IEEE/ACM International Conference on Software Engineering (ICSE), 2024

2024
[21]

Defihacklabs

SunWeb3Sec. Defihacklabs. https://github.com/SunWeb3Sec/DeFiHackLabs, 2026. GitHub repository, accessed March 7, 2026

2026
[22]

Sushiswap: Community-governed DEX.https://sushi.com, 2020

SushiSwap. Sushiswap: Community-governed DEX.https://sushi.com, 2020. Accessed 2026

2020
[23]

EIP-20: Token standard.https://eips.ethereum

Fabian Vogelsteller and Vitalik Buterin. EIP-20: Token standard.https://eips.ethereum. org/EIPS/eip-20, 2015. Ethereum Improvement Proposal

2015
[24]

Lin, Andy Applebaum, Tejal Patwardhan, Alpin Yukseloglu, and Olivia Watkins

Justin Wang, Andreas Bigger, Xiaohai Xu, Justin W. Lin, Andy Applebaum, Tejal Patwardhan, Alpin Yukseloglu, and Olivia Watkins. EVMbench: Evaluating AI agents on smart contract security.arXiv preprint arXiv:2603.04915, 2026

arXiv 2026
[25]

Cy- berGym: Evaluating AI agents’ real-world cybersecurity capabilities at scale.arXiv preprint arXiv:2506.02548, 2025

Zhun Wang, Tianneng Shi, Jingxuan He, Matthew Cai, Jialin Zhang, and Dawn Song. Cy- berGym: Evaluating AI agents’ real-world cybersecurity capabilities at scale.arXiv preprint arXiv:2506.02548, 2025

arXiv 2025
[26]

Exploitgym: Can ai agents turn security vulnerabilities into real attacks?, 2026

Zhun Wang, Nico Schiller, Hongwei Li, Srijiith Sesha Narayana, Milad Nasr, Nicholas Carlini, Xi- angyu Qi, Eric Wallace, Elie Bursztein, Luca Invernizzi, Kurt Thomas, Yan Shoshitaishvili, Wenbo Guo, Jingxuan He, Thorsten Holz, and Dawn Song. Exploitgym: Can ai agents turn security vulnerabilities into real attacks?, 2026. URLhttps://arxiv.org/abs/2605.11086

Pith/arXiv arXiv 2026
[27]

Ethereum: A secure decentralised generalised transaction ledger.Ethereum Project Yellow Paper, 2014

Gavin Wood. Ethereum: A secure decentralised generalised transaction ledger.Ethereum Project Yellow Paper, 2014

2014
[28]

Sc-bench: A large-scale dataset for smart contract auditing

Shihao Xia, Mengting He, Linhai Song, and Yiying Zhang. Sc-bench: A large-scale dataset for smart contract auditing. In2025 IEEE/ACM International Workshop on Large Language Models for Code (LLM4Code), pages 57–64, 2025. doi: 10.1109/LLM4Code66737.2025.00012

work page doi:10.1109/llm4code66737.2025.00012 2025
[29]

SCONE- bench: Smart contracts exploitation benchmark

Winnie Xiao, Cole Killian, Henry Sleight, Alan Chan, Nicholas Carlini, and Alwin Peng. SCONE- bench: Smart contracts exploitation benchmark. https://red.anthropic.com/2025/ smart-contracts/, 2025. Anthropic Red Team, accessed March 7, 2026

2025
[30]

Bhatia, Vikram Sivashankar, Yuxuan Bao, Dawn Song, Dan Boneh, Daniel E

Andy K Zhang, Joey Ji, Celeste Menders, Riya Dulepet, Thomas Qin, Ron Yifeng Wang, Junrong Wu, Kyleen Liao, Jiliang Li, Jinghan Hu, Sara Hong, Nardos Demilew, Shivatmica Murgai, Jason Khiem Tran, Nishka Kacheria, Ethan Jun shen Ho, Denis Liu, Lauren McLane, Olivia Beyer Bruvik, Dai-Rong Han, Seungwoo Kim, Akhil Vyas, Cuiyuanxiu Chen, Ryan Li, Weiran Xu, J...

arXiv 2025
[31]

if I had to fix this with a minimal code change, which function would I modify?

Liyi Zhou, Xihan Xiong, Jens Ernstberger, Stefanos Chaliasos, Zhipeng Wang, Ye Wang, Kaihua Qin, Roger Wattenhofer, Dawn Song, and Arthur Gervais. SoK: Decentralized finance (DeFi) attacks. InProceedings of the 44th IEEE Symposium on Security and Privacy (S&P), 2023. 15 A. Vulnerability Taxonomy Definitions Table 6 defines the five vulnerability types wit...

2023

[1] [1]

Online decompiler, accessed May 2026

Ethervm: Ethereum virtual machine bytecode decompiler.https://ethervm.io/, 2026. Online decompiler, accessed May 2026

2026

[2] [2]

Uniswap v3 core.Uniswap Whitepaper, 2021

Hayden Adams, Noah Zinsmeister, Moody Salem, River Keefer, and Dan Robinson. Uniswap v3 core.Uniswap Whitepaper, 2021

2021

[3] [3]

Alchemy: The web3 development platform, 2024

Alchemy. Alchemy: The web3 development platform, 2024. URLhttps://www.alchemy. com/

2024

[4] [4]

Model context protocol.https://modelcontextprotocol.io/, 2024

Anthropic. Model context protocol.https://modelcontextprotocol.io/, 2024. Open standard for connecting LLMs with external tools and data sources. 12

2024

[5] [5]

BNBSmartChainwhitepaper

BNBChain. BNBSmartChainwhitepaper. https://github.com/bnb-chain/whitepaper, 2020

2020

[6] [6]

Smart- poc: Generating executable and validated pocs for smart contract bug reports.arXiv preprint arXiv:2511.12993, 2025

Longfei Chen, Ruibin Yan, Taiyu Wong, Yiyang Chen, Jialai Wang, and Chao Zhang. Smart- poc: Generating executable and validated pocs for smart contract bug reports.arXiv preprint arXiv:2511.12993, 2025

arXiv 2025

[7] [7]

Etherscan: The Ethereum blockchain explorer.https://etherscan.io/, 2024

Etherscan. Etherscan: The Ethereum blockchain explorer.https://etherscan.io/, 2024. Accessed 2026

2024

[8] [8]

Slither: Astaticanalysisframeworkforsmartcon- tracts.Proceedings of the 2nd International Workshop on Emerging Trends in Software Engineering for Blockchain (WETSEB), 2019

JosselinFeist, GustavoGrieco, andAlexGroce. Slither: Astaticanalysisframeworkforsmartcon- tracts.Proceedings of the 2nd International Workshop on Emerging Trends in Software Engineering for Blockchain (WETSEB), 2019

2019

[9] [9]

AI agent smart contract exploit generation.arXiv preprint arXiv:2507.05558, 2025

Arthur Gervais and Liyi Zhou. AI agent smart contract exploit generation.arXiv preprint arXiv:2507.05558, 2025

arXiv 2025

[10] [10]

Echidna: Effective, usable, andfastfuzzingforsmartcontracts

Gustavo Grieco, Will Song, Artur Cygan, Josselin Feist, and Alex Groce. Echidna: Effective, usable, andfastfuzzingforsmartcontracts. InProceedingsofthe29thACMSIGSOFTInternational Symposium on Software Testing and Analysis (ISSTA), 2020

2020

[11] [11]

Harbor: Sandboxed agent evaluation framework.https://github.com/ harbor-framework/harbor, 2025

Harbor Framework. Harbor: Sandboxed agent evaluation framework.https://github.com/ harbor-framework/harbor, 2025. Docker-based evaluation orchestration for LLM agents, accessed 2026

2025

[12] [12]

Kalodner, Steven Goldfeder, Xiaoqi Chen, S

Harry A. Kalodner, Steven Goldfeder, Xiaoqi Chen, S. Matthew Weinberg, and Edward W. Felten. Arbitrum: Scalable, private smart contracts.Proceedings of the 27th USENIX Security Symposium, 2018

2018

[13] [13]

Merrill, Alexander G

Mike A. Merrill, Alexander G. Shaw, Nicholas Carlini, Boxuan Li, Harsh Raj, Ivan Bercovich, Lin Shi, Jeong Yeon Shin, Thomas Walshe, E. Kelly Buchanan, Junhong Shen, Guanghao Ye, Haowei Lin, Jason Poulos, Maoyu Wang, Marianna Nezhurina, Jenia Jitsev, Di Lu, Orfeas Menis Mastromichalakis, Zhiwei Xu, Zizhao Chen, Yue Liu, Robert Zhang, Leon Liangyu Chen, An...

Pith/arXiv arXiv 2026

[14] [14]

Smashing Ethereum smart contracts for fun and real profit

Bernhard Mueller. Smashing Ethereum smart contracts for fun and real profit. InHITB Security Conference, 2018

2018

[15] [15]

EIP-1967: Proxy storage slots.https://eips.ethereum.org/EIPS/ eip-1967, 2019

Santiago Palladino. EIP-1967: Proxy storage slots.https://eips.ethereum.org/EIPS/ eip-1967, 2019. Ethereum Improvement Proposal

1967

[16] [16]

Pancakeswap: Decentralized exchange on BNB chain.https://pancakeswap

PancakeSwap. Pancakeswap: Decentralized exchange on BNB chain.https://pancakeswap. finance, 2021. Accessed 2026. 13

2021

[17] [17]

Foundry: A blazing fast, portable and modular toolkit for Ethereum application development.https://github.com/foundry-rs/foundry, 2024

Paradigm. Foundry: A blazing fast, portable and modular toolkit for Ethereum application development.https://github.com/foundry-rs/foundry, 2024. Accessed 2026

2024

[18] [18]

Polygon: Ethereum’s internet of blockchains.https://polygon.technology/ papers/pol-whitepaper, 2021

Polygon Labs. Polygon: Ethereum’s internet of blockchains.https://polygon.technology/ papers/pol-whitepaper, 2021

2021

[19] [19]

LISA technical report: An agentic framework for smart contract auditing.arXiv preprint arXiv:2509.24698, 2025

Izaiah Sun, Daniel Tan, and Andy Deng. LISA technical report: An agentic framework for smart contract auditing.arXiv preprint arXiv:2509.24698, 2025

arXiv 2025

[20] [20]

GPTScan: Detecting logic vulnerabilities in smart contracts by combining GPT with program analysis

Yuqiang Sun, Daoyuan Wu, Yue Xue, Han Liu, Haijun Wang, Zhengzi Xu, Xiaofei Xie, and Yang Liu. GPTScan: Detecting logic vulnerabilities in smart contracts by combining GPT with program analysis. InProceedings of the 46th IEEE/ACM International Conference on Software Engineering (ICSE), 2024

2024

[21] [21]

Defihacklabs

SunWeb3Sec. Defihacklabs. https://github.com/SunWeb3Sec/DeFiHackLabs, 2026. GitHub repository, accessed March 7, 2026

2026

[22] [22]

Sushiswap: Community-governed DEX.https://sushi.com, 2020

SushiSwap. Sushiswap: Community-governed DEX.https://sushi.com, 2020. Accessed 2026

2020

[23] [23]

EIP-20: Token standard.https://eips.ethereum

Fabian Vogelsteller and Vitalik Buterin. EIP-20: Token standard.https://eips.ethereum. org/EIPS/eip-20, 2015. Ethereum Improvement Proposal

2015

[24] [24]

Lin, Andy Applebaum, Tejal Patwardhan, Alpin Yukseloglu, and Olivia Watkins

Justin Wang, Andreas Bigger, Xiaohai Xu, Justin W. Lin, Andy Applebaum, Tejal Patwardhan, Alpin Yukseloglu, and Olivia Watkins. EVMbench: Evaluating AI agents on smart contract security.arXiv preprint arXiv:2603.04915, 2026

arXiv 2026

[25] [25]

Cy- berGym: Evaluating AI agents’ real-world cybersecurity capabilities at scale.arXiv preprint arXiv:2506.02548, 2025

Zhun Wang, Tianneng Shi, Jingxuan He, Matthew Cai, Jialin Zhang, and Dawn Song. Cy- berGym: Evaluating AI agents’ real-world cybersecurity capabilities at scale.arXiv preprint arXiv:2506.02548, 2025

arXiv 2025

[26] [26]

Exploitgym: Can ai agents turn security vulnerabilities into real attacks?, 2026

Zhun Wang, Nico Schiller, Hongwei Li, Srijiith Sesha Narayana, Milad Nasr, Nicholas Carlini, Xi- angyu Qi, Eric Wallace, Elie Bursztein, Luca Invernizzi, Kurt Thomas, Yan Shoshitaishvili, Wenbo Guo, Jingxuan He, Thorsten Holz, and Dawn Song. Exploitgym: Can ai agents turn security vulnerabilities into real attacks?, 2026. URLhttps://arxiv.org/abs/2605.11086

Pith/arXiv arXiv 2026

[27] [27]

Ethereum: A secure decentralised generalised transaction ledger.Ethereum Project Yellow Paper, 2014

Gavin Wood. Ethereum: A secure decentralised generalised transaction ledger.Ethereum Project Yellow Paper, 2014

2014

[28] [28]

Sc-bench: A large-scale dataset for smart contract auditing

Shihao Xia, Mengting He, Linhai Song, and Yiying Zhang. Sc-bench: A large-scale dataset for smart contract auditing. In2025 IEEE/ACM International Workshop on Large Language Models for Code (LLM4Code), pages 57–64, 2025. doi: 10.1109/LLM4Code66737.2025.00012

work page doi:10.1109/llm4code66737.2025.00012 2025

[29] [29]

SCONE- bench: Smart contracts exploitation benchmark

Winnie Xiao, Cole Killian, Henry Sleight, Alan Chan, Nicholas Carlini, and Alwin Peng. SCONE- bench: Smart contracts exploitation benchmark. https://red.anthropic.com/2025/ smart-contracts/, 2025. Anthropic Red Team, accessed March 7, 2026

2025

[30] [30]

Bhatia, Vikram Sivashankar, Yuxuan Bao, Dawn Song, Dan Boneh, Daniel E

Andy K Zhang, Joey Ji, Celeste Menders, Riya Dulepet, Thomas Qin, Ron Yifeng Wang, Junrong Wu, Kyleen Liao, Jiliang Li, Jinghan Hu, Sara Hong, Nardos Demilew, Shivatmica Murgai, Jason Khiem Tran, Nishka Kacheria, Ethan Jun shen Ho, Denis Liu, Lauren McLane, Olivia Beyer Bruvik, Dai-Rong Han, Seungwoo Kim, Akhil Vyas, Cuiyuanxiu Chen, Ryan Li, Weiran Xu, J...

arXiv 2025

[31] [31]

if I had to fix this with a minimal code change, which function would I modify?

Liyi Zhou, Xihan Xiong, Jens Ernstberger, Stefanos Chaliasos, Zhipeng Wang, Ye Wang, Kaihua Qin, Roger Wattenhofer, Dawn Song, and Arthur Gervais. SoK: Decentralized finance (DeFi) attacks. InProceedings of the 44th IEEE Symposium on Security and Privacy (S&P), 2023. 15 A. Vulnerability Taxonomy Definitions Table 6 defines the five vulnerability types wit...

2023