CyberChainBench: Can AI Agents Secure Smart Contracts Against Real-World On-Chain Vulnerabilities?
Pith reviewed 2026-06-26 01:44 UTC · model grok-4.3
The pith
AI agents detect 37.5% of real smart contract vulnerabilities, generate exploits in 43.7% of cases, and patch 23.4%, with top runs netting $57.4M in simulated profit.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper establishes CyberChainBench as an end-to-end on-chain benchmark drawn from 541 real incidents and evaluates multiple agent-model pairs across vulnerability detection, exploit generation, and patch synthesis. The strongest configuration reaches 37.5% detection, 43.7% successful exploitation, and 23.4% patching on the 200-case exploitation subset; the leading agent (Codex with GPT-5.5) produces $57.4M in total historical profit at $2.39 average cost per case. Exploits are scored by actual economic impact on mainnet forks, and patches are checked by replaying both attack and legitimate transactions on upgradeable contracts.
What carries the argument
CyberChainBench with its Harbor-orchestrated mainnet-fork environments, five-type vulnerability taxonomy, and structured ground truth that ties each case to a specific block, vulnerability location, and attacker profit.
If this is right
- Current agents succeed more often at generating profitable exploits than at localizing vulnerabilities or synthesizing patches.
- One agent configuration can accumulate tens of millions in simulated exploit value across hundreds of cases at under three dollars per attempt.
- Patch synthesis remains the hardest of the three tasks even when agents have access to the same tools and historical state.
- Economic-impact grading on real forks provides a concrete, money-denominated metric that distinguishes working exploits from theoretical ones.
- Replay of both attack and legitimate transactions supplies an automatic oracle for validating whether a patch actually stops the vulnerability.
Where Pith is reading between the lines
- If the measured gap persists, agent tooling may currently be more immediately useful for red-team analysis than for automated blue-team patching.
- Improving patch performance would likely require new mechanisms for reasoning about state invariants and upgrade safety rather than simply scaling existing detection or exploitation pipelines.
- The benchmark's per-case cost figures suggest that offensive use of agents could become economically attractive at scale if success rates rise only modestly.
- Extending the evaluation to non-EVM chains or to incidents after the current data cutoff would test whether the observed difficulty ordering generalizes.
Load-bearing premise
The 541 incidents and the 200-case subset are an unbiased sample of real vulnerabilities and the isolated fork environments reproduce historical on-chain behavior without artifacts or leakage.
What would settle it
Running the identical agent configurations on a fresh collection of post-cutoff incidents and observing materially lower success rates or profit totals than the reported figures.
read the original abstract
We present CyberChainBench, a benchmark for evaluating LLM-based agents on smart contract security across three complementary tasks: vulnerability detection, exploit generation, and patch synthesis. Built from 541 real-world exploit incidents from DeFiHackLabs spanning 9 EVM chains, the benchmark provides end-to-end on-chain evaluation where agents interact with historical blockchain state through isolated evaluation environments orchestrated by Harbor, using tools to read code, trace transactions, and validate exploits on mainnet forks. Each case is anchored to a specific block and includes structured ground truth covering vulnerability type, localization, and attacker profit. Exploits are graded by economic impact on historical forks; patches are validated by replaying historical attacks and legitimate transactions as fail-to-pass test oracles on a proxy-upgradeable subset. We define a five-type vulnerability taxonomy and evaluate multiple agent--model configurations. Results reveal a clear difficulty gradient: the best configuration scores 37.5% on detection, 43.7% on exploitation, but only 23.4% on patching, with the top agent (Codex with GPT-5.5) realizing \$57.4M in total exploit profit across the 200-case exploit set at a cost of $2.39 per case.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces CyberChainBench, a benchmark for LLM-based agents on smart contract security across vulnerability detection, exploit generation, and patch synthesis. It is constructed from 541 real-world incidents in the public DeFiHackLabs repository spanning 9 EVM chains, with agents interacting via tools in isolated Harbor environments on historical mainnet forks. Exploits are scored by realized economic impact on historical states, and patches are validated via replay oracles on a proxy-upgradeable subset. The best agent-model configuration reports 37.5% detection, 43.7% exploitation, and 23.4% patching, with the top agent (Codex + GPT-5.5) achieving $57.4M aggregate exploit profit across the 200-case subset at $2.39 per case.
Significance. If the empirical results hold after addressing selection and leakage concerns, the benchmark would be a valuable contribution to the field by supplying an economically grounded, on-chain evaluation of agent capabilities against documented real-world vulnerabilities. The difficulty gradient across the three tasks and the use of profit-based grading on historical forks provide concrete, falsifiable metrics that go beyond synthetic test suites.
major comments (3)
- [Abstract] Abstract: The manuscript states that a 200-case subset is used for exploitation evaluation but provides no information on selection criteria from the 541 incidents, stratification by chain/vulnerability type, or statistical power. This directly affects the reliability of the 43.7% exploitation rate and $57.4M profit aggregate.
- [Abstract] Abstract: The benchmark relies exclusively on publicly documented DeFiHackLabs incidents (including code, traces, and profit figures). No discussion or mitigation is provided for the risk that evaluated models (including GPT-5.5) may have encountered these incidents during pretraining, which would allow success via memorization rather than analysis of the supplied state in the Harbor environments. This is load-bearing for the validity of the exploitation and profit results.
- [Abstract] Abstract: Ground-truth construction for vulnerability type, localization, and attacker profit is asserted but not described (e.g., manual annotation protocol, inter-annotator agreement, or use of on-chain transaction data). Without this, it is impossible to assess whether the reported scores measure the claimed capabilities.
minor comments (1)
- [Abstract] The abstract should clarify whether the $2.39 per-case cost includes only model API usage or also the computational overhead of the Harbor mainnet-fork environments.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on subset selection, data leakage risks, and ground-truth construction. We address each point below and indicate planned revisions where appropriate.
read point-by-point responses
-
Referee: [Abstract] Abstract: The manuscript states that a 200-case subset is used for exploitation evaluation but provides no information on selection criteria from the 541 incidents, stratification by chain/vulnerability type, or statistical power. This directly affects the reliability of the 43.7% exploitation rate and $57.4M profit aggregate.
Authors: We agree the abstract omits these details. The full manuscript (Section 4.1) states the 200-case subset was drawn from the 541 incidents to manage evaluation cost while preserving diversity, but does not specify the exact sampling procedure, stratification, or power analysis. We will revise the abstract, Section 4, and add an appendix with the selection criteria (including chain and vulnerability-type stratification) plus a brief statistical power discussion. revision: yes
-
Referee: [Abstract] Abstract: The benchmark relies exclusively on publicly documented DeFiHackLabs incidents (including code, traces, and profit figures). No discussion or mitigation is provided for the risk that evaluated models (including GPT-5.5) may have encountered these incidents during pretraining, which would allow success via memorization rather than analysis of the supplied state in the Harbor environments. This is load-bearing for the validity of the exploitation and profit results.
Authors: We acknowledge this is a substantive concern not addressed in the current manuscript. The evaluation uses historical mainnet forks and requires agents to interact with live contract state via tools, which in principle demands context-specific reasoning rather than pure recall. However, we cannot verify or eliminate pretraining exposure for closed models. We will add an explicit limitations paragraph discussing the memorization risk and will note that future iterations could incorporate post-cutoff models or private incidents where feasible. revision: partial
-
Referee: [Abstract] Abstract: Ground-truth construction for vulnerability type, localization, and attacker profit is asserted but not described (e.g., manual annotation protocol, inter-annotator agreement, or use of on-chain transaction data). Without this, it is impossible to assess whether the reported scores measure the claimed capabilities.
Authors: The ground truth is taken directly from the structured annotations already present in the public DeFiHackLabs repository, which itself derives vulnerability type, location, and profit figures from the original on-chain transactions and post-mortem reports. The manuscript does not describe the extraction or validation process. We will expand Section 3 to include a concise description of how these fields were mapped from the repository and any manual verification steps performed by the authors. revision: yes
- Complete verification or elimination of pretraining leakage for proprietary models, as the authors lack access to training data or cutoff dates.
Circularity Check
No circularity: direct empirical benchmark on external incidents
full rationale
The paper constructs a benchmark from 541 external DeFiHackLabs incidents and evaluates agent performance via direct measurements on historical blockchain states in isolated Harbor environments. No equations, predictions, or claims reduce by construction to fitted parameters or self-referential definitions. Results (detection 37.5%, exploitation 43.7%, patching 23.4%) are reported as empirical outcomes with ground-truth anchors, not derived quantities. No self-citation load-bearing steps or ansatzes appear in the provided text. The derivation chain is self-contained against the external incident data.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Historical EVM blockchain states can be forked and replayed accurately enough for exploit validation and patch testing
Reference graph
Works this paper leans on
-
[1]
Online decompiler, accessed May 2026
Ethervm: Ethereum virtual machine bytecode decompiler.https://ethervm.io/, 2026. Online decompiler, accessed May 2026
2026
-
[2]
Uniswap v3 core.Uniswap Whitepaper, 2021
Hayden Adams, Noah Zinsmeister, Moody Salem, River Keefer, and Dan Robinson. Uniswap v3 core.Uniswap Whitepaper, 2021
2021
-
[3]
Alchemy: The web3 development platform, 2024
Alchemy. Alchemy: The web3 development platform, 2024. URLhttps://www.alchemy. com/
2024
-
[4]
Model context protocol.https://modelcontextprotocol.io/, 2024
Anthropic. Model context protocol.https://modelcontextprotocol.io/, 2024. Open standard for connecting LLMs with external tools and data sources. 12
2024
-
[5]
BNBSmartChainwhitepaper
BNBChain. BNBSmartChainwhitepaper. https://github.com/bnb-chain/whitepaper, 2020
2020
-
[6]
Longfei Chen, Ruibin Yan, Taiyu Wong, Yiyang Chen, Jialai Wang, and Chao Zhang. Smart- poc: Generating executable and validated pocs for smart contract bug reports.arXiv preprint arXiv:2511.12993, 2025
arXiv 2025
-
[7]
Etherscan: The Ethereum blockchain explorer.https://etherscan.io/, 2024
Etherscan. Etherscan: The Ethereum blockchain explorer.https://etherscan.io/, 2024. Accessed 2026
2024
-
[8]
Slither: Astaticanalysisframeworkforsmartcon- tracts.Proceedings of the 2nd International Workshop on Emerging Trends in Software Engineering for Blockchain (WETSEB), 2019
JosselinFeist, GustavoGrieco, andAlexGroce. Slither: Astaticanalysisframeworkforsmartcon- tracts.Proceedings of the 2nd International Workshop on Emerging Trends in Software Engineering for Blockchain (WETSEB), 2019
2019
-
[9]
AI agent smart contract exploit generation.arXiv preprint arXiv:2507.05558, 2025
Arthur Gervais and Liyi Zhou. AI agent smart contract exploit generation.arXiv preprint arXiv:2507.05558, 2025
arXiv 2025
-
[10]
Echidna: Effective, usable, andfastfuzzingforsmartcontracts
Gustavo Grieco, Will Song, Artur Cygan, Josselin Feist, and Alex Groce. Echidna: Effective, usable, andfastfuzzingforsmartcontracts. InProceedingsofthe29thACMSIGSOFTInternational Symposium on Software Testing and Analysis (ISSTA), 2020
2020
-
[11]
Harbor: Sandboxed agent evaluation framework.https://github.com/ harbor-framework/harbor, 2025
Harbor Framework. Harbor: Sandboxed agent evaluation framework.https://github.com/ harbor-framework/harbor, 2025. Docker-based evaluation orchestration for LLM agents, accessed 2026
2025
-
[12]
Kalodner, Steven Goldfeder, Xiaoqi Chen, S
Harry A. Kalodner, Steven Goldfeder, Xiaoqi Chen, S. Matthew Weinberg, and Edward W. Felten. Arbitrum: Scalable, private smart contracts.Proceedings of the 27th USENIX Security Symposium, 2018
2018
-
[13]
Mike A. Merrill, Alexander G. Shaw, Nicholas Carlini, Boxuan Li, Harsh Raj, Ivan Bercovich, Lin Shi, Jeong Yeon Shin, Thomas Walshe, E. Kelly Buchanan, Junhong Shen, Guanghao Ye, Haowei Lin, Jason Poulos, Maoyu Wang, Marianna Nezhurina, Jenia Jitsev, Di Lu, Orfeas Menis Mastromichalakis, Zhiwei Xu, Zizhao Chen, Yue Liu, Robert Zhang, Leon Liangyu Chen, An...
Pith/arXiv arXiv 2026
-
[14]
Smashing Ethereum smart contracts for fun and real profit
Bernhard Mueller. Smashing Ethereum smart contracts for fun and real profit. InHITB Security Conference, 2018
2018
-
[15]
EIP-1967: Proxy storage slots.https://eips.ethereum.org/EIPS/ eip-1967, 2019
Santiago Palladino. EIP-1967: Proxy storage slots.https://eips.ethereum.org/EIPS/ eip-1967, 2019. Ethereum Improvement Proposal
1967
-
[16]
Pancakeswap: Decentralized exchange on BNB chain.https://pancakeswap
PancakeSwap. Pancakeswap: Decentralized exchange on BNB chain.https://pancakeswap. finance, 2021. Accessed 2026. 13
2021
-
[17]
Foundry: A blazing fast, portable and modular toolkit for Ethereum application development.https://github.com/foundry-rs/foundry, 2024
Paradigm. Foundry: A blazing fast, portable and modular toolkit for Ethereum application development.https://github.com/foundry-rs/foundry, 2024. Accessed 2026
2024
-
[18]
Polygon: Ethereum’s internet of blockchains.https://polygon.technology/ papers/pol-whitepaper, 2021
Polygon Labs. Polygon: Ethereum’s internet of blockchains.https://polygon.technology/ papers/pol-whitepaper, 2021
2021
-
[19]
Izaiah Sun, Daniel Tan, and Andy Deng. LISA technical report: An agentic framework for smart contract auditing.arXiv preprint arXiv:2509.24698, 2025
arXiv 2025
-
[20]
GPTScan: Detecting logic vulnerabilities in smart contracts by combining GPT with program analysis
Yuqiang Sun, Daoyuan Wu, Yue Xue, Han Liu, Haijun Wang, Zhengzi Xu, Xiaofei Xie, and Yang Liu. GPTScan: Detecting logic vulnerabilities in smart contracts by combining GPT with program analysis. InProceedings of the 46th IEEE/ACM International Conference on Software Engineering (ICSE), 2024
2024
-
[21]
Defihacklabs
SunWeb3Sec. Defihacklabs. https://github.com/SunWeb3Sec/DeFiHackLabs, 2026. GitHub repository, accessed March 7, 2026
2026
-
[22]
Sushiswap: Community-governed DEX.https://sushi.com, 2020
SushiSwap. Sushiswap: Community-governed DEX.https://sushi.com, 2020. Accessed 2026
2020
-
[23]
EIP-20: Token standard.https://eips.ethereum
Fabian Vogelsteller and Vitalik Buterin. EIP-20: Token standard.https://eips.ethereum. org/EIPS/eip-20, 2015. Ethereum Improvement Proposal
2015
-
[24]
Lin, Andy Applebaum, Tejal Patwardhan, Alpin Yukseloglu, and Olivia Watkins
Justin Wang, Andreas Bigger, Xiaohai Xu, Justin W. Lin, Andy Applebaum, Tejal Patwardhan, Alpin Yukseloglu, and Olivia Watkins. EVMbench: Evaluating AI agents on smart contract security.arXiv preprint arXiv:2603.04915, 2026
arXiv 2026
-
[25]
Zhun Wang, Tianneng Shi, Jingxuan He, Matthew Cai, Jialin Zhang, and Dawn Song. Cy- berGym: Evaluating AI agents’ real-world cybersecurity capabilities at scale.arXiv preprint arXiv:2506.02548, 2025
arXiv 2025
-
[26]
Exploitgym: Can ai agents turn security vulnerabilities into real attacks?, 2026
Zhun Wang, Nico Schiller, Hongwei Li, Srijiith Sesha Narayana, Milad Nasr, Nicholas Carlini, Xi- angyu Qi, Eric Wallace, Elie Bursztein, Luca Invernizzi, Kurt Thomas, Yan Shoshitaishvili, Wenbo Guo, Jingxuan He, Thorsten Holz, and Dawn Song. Exploitgym: Can ai agents turn security vulnerabilities into real attacks?, 2026. URLhttps://arxiv.org/abs/2605.11086
Pith/arXiv arXiv 2026
-
[27]
Ethereum: A secure decentralised generalised transaction ledger.Ethereum Project Yellow Paper, 2014
Gavin Wood. Ethereum: A secure decentralised generalised transaction ledger.Ethereum Project Yellow Paper, 2014
2014
-
[28]
Sc-bench: A large-scale dataset for smart contract auditing
Shihao Xia, Mengting He, Linhai Song, and Yiying Zhang. Sc-bench: A large-scale dataset for smart contract auditing. In2025 IEEE/ACM International Workshop on Large Language Models for Code (LLM4Code), pages 57–64, 2025. doi: 10.1109/LLM4Code66737.2025.00012
-
[29]
SCONE- bench: Smart contracts exploitation benchmark
Winnie Xiao, Cole Killian, Henry Sleight, Alan Chan, Nicholas Carlini, and Alwin Peng. SCONE- bench: Smart contracts exploitation benchmark. https://red.anthropic.com/2025/ smart-contracts/, 2025. Anthropic Red Team, accessed March 7, 2026
2025
-
[30]
Bhatia, Vikram Sivashankar, Yuxuan Bao, Dawn Song, Dan Boneh, Daniel E
Andy K Zhang, Joey Ji, Celeste Menders, Riya Dulepet, Thomas Qin, Ron Yifeng Wang, Junrong Wu, Kyleen Liao, Jiliang Li, Jinghan Hu, Sara Hong, Nardos Demilew, Shivatmica Murgai, Jason Khiem Tran, Nishka Kacheria, Ethan Jun shen Ho, Denis Liu, Lauren McLane, Olivia Beyer Bruvik, Dai-Rong Han, Seungwoo Kim, Akhil Vyas, Cuiyuanxiu Chen, Ryan Li, Weiran Xu, J...
arXiv 2025
-
[31]
if I had to fix this with a minimal code change, which function would I modify?
Liyi Zhou, Xihan Xiong, Jens Ernstberger, Stefanos Chaliasos, Zhipeng Wang, Ye Wang, Kaihua Qin, Roger Wattenhofer, Dawn Song, and Arthur Gervais. SoK: Decentralized finance (DeFi) attacks. InProceedings of the 44th IEEE Symposium on Security and Privacy (S&P), 2023. 15 A. Vulnerability Taxonomy Definitions Table 6 defines the five vulnerability types wit...
2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.