pith. sign in

arxiv: 2606.26216 · v1 · pith:2CSWJVHAnew · submitted 2026-06-24 · 💻 cs.CR · cs.AI

CyberChainBench: Can AI Agents Secure Smart Contracts Against Real-World On-Chain Vulnerabilities?

Pith reviewed 2026-06-26 01:44 UTC · model grok-4.3

classification 💻 cs.CR cs.AI
keywords smart contract securityLLM agentsvulnerability detectionexploit generationpatch synthesisDeFi incidentson-chain evaluationblockchain benchmark
0
0 comments X

The pith

AI agents detect 37.5% of real smart contract vulnerabilities, generate exploits in 43.7% of cases, and patch 23.4%, with top runs netting $57.4M in simulated profit.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces CyberChainBench, a new evaluation set built from 541 historical DeFi exploit incidents across multiple chains. It tests LLM agents on three linked tasks using live blockchain state in isolated forks: spotting the flaw, writing an exploit that extracts value, and producing a patch that stops the attack while preserving normal transactions. Results show a clear ordering of difficulty, with exploitation outperforming detection and patching lagging furthest behind. The top agent earns substantial total profit across the 200-case subset at very low per-case cost. A reader would care because these contracts manage real funds and the numbers quantify how far current agents are from reliable automated defense.

Core claim

The paper establishes CyberChainBench as an end-to-end on-chain benchmark drawn from 541 real incidents and evaluates multiple agent-model pairs across vulnerability detection, exploit generation, and patch synthesis. The strongest configuration reaches 37.5% detection, 43.7% successful exploitation, and 23.4% patching on the 200-case exploitation subset; the leading agent (Codex with GPT-5.5) produces $57.4M in total historical profit at $2.39 average cost per case. Exploits are scored by actual economic impact on mainnet forks, and patches are checked by replaying both attack and legitimate transactions on upgradeable contracts.

What carries the argument

CyberChainBench with its Harbor-orchestrated mainnet-fork environments, five-type vulnerability taxonomy, and structured ground truth that ties each case to a specific block, vulnerability location, and attacker profit.

If this is right

  • Current agents succeed more often at generating profitable exploits than at localizing vulnerabilities or synthesizing patches.
  • One agent configuration can accumulate tens of millions in simulated exploit value across hundreds of cases at under three dollars per attempt.
  • Patch synthesis remains the hardest of the three tasks even when agents have access to the same tools and historical state.
  • Economic-impact grading on real forks provides a concrete, money-denominated metric that distinguishes working exploits from theoretical ones.
  • Replay of both attack and legitimate transactions supplies an automatic oracle for validating whether a patch actually stops the vulnerability.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the measured gap persists, agent tooling may currently be more immediately useful for red-team analysis than for automated blue-team patching.
  • Improving patch performance would likely require new mechanisms for reasoning about state invariants and upgrade safety rather than simply scaling existing detection or exploitation pipelines.
  • The benchmark's per-case cost figures suggest that offensive use of agents could become economically attractive at scale if success rates rise only modestly.
  • Extending the evaluation to non-EVM chains or to incidents after the current data cutoff would test whether the observed difficulty ordering generalizes.

Load-bearing premise

The 541 incidents and the 200-case subset are an unbiased sample of real vulnerabilities and the isolated fork environments reproduce historical on-chain behavior without artifacts or leakage.

What would settle it

Running the identical agent configurations on a fresh collection of post-cutoff incidents and observing materially lower success rates or profit totals than the reported figures.

read the original abstract

We present CyberChainBench, a benchmark for evaluating LLM-based agents on smart contract security across three complementary tasks: vulnerability detection, exploit generation, and patch synthesis. Built from 541 real-world exploit incidents from DeFiHackLabs spanning 9 EVM chains, the benchmark provides end-to-end on-chain evaluation where agents interact with historical blockchain state through isolated evaluation environments orchestrated by Harbor, using tools to read code, trace transactions, and validate exploits on mainnet forks. Each case is anchored to a specific block and includes structured ground truth covering vulnerability type, localization, and attacker profit. Exploits are graded by economic impact on historical forks; patches are validated by replaying historical attacks and legitimate transactions as fail-to-pass test oracles on a proxy-upgradeable subset. We define a five-type vulnerability taxonomy and evaluate multiple agent--model configurations. Results reveal a clear difficulty gradient: the best configuration scores 37.5% on detection, 43.7% on exploitation, but only 23.4% on patching, with the top agent (Codex with GPT-5.5) realizing \$57.4M in total exploit profit across the 200-case exploit set at a cost of $2.39 per case.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper introduces CyberChainBench, a benchmark for LLM-based agents on smart contract security across vulnerability detection, exploit generation, and patch synthesis. It is constructed from 541 real-world incidents in the public DeFiHackLabs repository spanning 9 EVM chains, with agents interacting via tools in isolated Harbor environments on historical mainnet forks. Exploits are scored by realized economic impact on historical states, and patches are validated via replay oracles on a proxy-upgradeable subset. The best agent-model configuration reports 37.5% detection, 43.7% exploitation, and 23.4% patching, with the top agent (Codex + GPT-5.5) achieving $57.4M aggregate exploit profit across the 200-case subset at $2.39 per case.

Significance. If the empirical results hold after addressing selection and leakage concerns, the benchmark would be a valuable contribution to the field by supplying an economically grounded, on-chain evaluation of agent capabilities against documented real-world vulnerabilities. The difficulty gradient across the three tasks and the use of profit-based grading on historical forks provide concrete, falsifiable metrics that go beyond synthetic test suites.

major comments (3)
  1. [Abstract] Abstract: The manuscript states that a 200-case subset is used for exploitation evaluation but provides no information on selection criteria from the 541 incidents, stratification by chain/vulnerability type, or statistical power. This directly affects the reliability of the 43.7% exploitation rate and $57.4M profit aggregate.
  2. [Abstract] Abstract: The benchmark relies exclusively on publicly documented DeFiHackLabs incidents (including code, traces, and profit figures). No discussion or mitigation is provided for the risk that evaluated models (including GPT-5.5) may have encountered these incidents during pretraining, which would allow success via memorization rather than analysis of the supplied state in the Harbor environments. This is load-bearing for the validity of the exploitation and profit results.
  3. [Abstract] Abstract: Ground-truth construction for vulnerability type, localization, and attacker profit is asserted but not described (e.g., manual annotation protocol, inter-annotator agreement, or use of on-chain transaction data). Without this, it is impossible to assess whether the reported scores measure the claimed capabilities.
minor comments (1)
  1. [Abstract] The abstract should clarify whether the $2.39 per-case cost includes only model API usage or also the computational overhead of the Harbor mainnet-fork environments.

Simulated Author's Rebuttal

3 responses · 1 unresolved

We thank the referee for the constructive feedback on subset selection, data leakage risks, and ground-truth construction. We address each point below and indicate planned revisions where appropriate.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The manuscript states that a 200-case subset is used for exploitation evaluation but provides no information on selection criteria from the 541 incidents, stratification by chain/vulnerability type, or statistical power. This directly affects the reliability of the 43.7% exploitation rate and $57.4M profit aggregate.

    Authors: We agree the abstract omits these details. The full manuscript (Section 4.1) states the 200-case subset was drawn from the 541 incidents to manage evaluation cost while preserving diversity, but does not specify the exact sampling procedure, stratification, or power analysis. We will revise the abstract, Section 4, and add an appendix with the selection criteria (including chain and vulnerability-type stratification) plus a brief statistical power discussion. revision: yes

  2. Referee: [Abstract] Abstract: The benchmark relies exclusively on publicly documented DeFiHackLabs incidents (including code, traces, and profit figures). No discussion or mitigation is provided for the risk that evaluated models (including GPT-5.5) may have encountered these incidents during pretraining, which would allow success via memorization rather than analysis of the supplied state in the Harbor environments. This is load-bearing for the validity of the exploitation and profit results.

    Authors: We acknowledge this is a substantive concern not addressed in the current manuscript. The evaluation uses historical mainnet forks and requires agents to interact with live contract state via tools, which in principle demands context-specific reasoning rather than pure recall. However, we cannot verify or eliminate pretraining exposure for closed models. We will add an explicit limitations paragraph discussing the memorization risk and will note that future iterations could incorporate post-cutoff models or private incidents where feasible. revision: partial

  3. Referee: [Abstract] Abstract: Ground-truth construction for vulnerability type, localization, and attacker profit is asserted but not described (e.g., manual annotation protocol, inter-annotator agreement, or use of on-chain transaction data). Without this, it is impossible to assess whether the reported scores measure the claimed capabilities.

    Authors: The ground truth is taken directly from the structured annotations already present in the public DeFiHackLabs repository, which itself derives vulnerability type, location, and profit figures from the original on-chain transactions and post-mortem reports. The manuscript does not describe the extraction or validation process. We will expand Section 3 to include a concise description of how these fields were mapped from the repository and any manual verification steps performed by the authors. revision: yes

standing simulated objections not resolved
  • Complete verification or elimination of pretraining leakage for proprietary models, as the authors lack access to training data or cutoff dates.

Circularity Check

0 steps flagged

No circularity: direct empirical benchmark on external incidents

full rationale

The paper constructs a benchmark from 541 external DeFiHackLabs incidents and evaluates agent performance via direct measurements on historical blockchain states in isolated Harbor environments. No equations, predictions, or claims reduce by construction to fitted parameters or self-referential definitions. Results (detection 37.5%, exploitation 43.7%, patching 23.4%) are reported as empirical outcomes with ground-truth anchors, not derived quantities. No self-citation load-bearing steps or ansatzes appear in the provided text. The derivation chain is self-contained against the external incident data.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on the assumption that the external DeFiHackLabs dataset accurately captures real vulnerabilities and that the Harbor-based fork evaluation faithfully reproduces historical behavior; no free parameters are fitted to produce the reported scores.

axioms (1)
  • domain assumption Historical EVM blockchain states can be forked and replayed accurately enough for exploit validation and patch testing
    Invoked for the on-chain evaluation environments and economic impact grading described in the abstract.

pith-pipeline@v0.9.1-grok · 5762 in / 1481 out tokens · 24995 ms · 2026-06-26T01:44:48.130639+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

31 extracted references · 1 canonical work pages

  1. [1]

    Online decompiler, accessed May 2026

    Ethervm: Ethereum virtual machine bytecode decompiler.https://ethervm.io/, 2026. Online decompiler, accessed May 2026

  2. [2]

    Uniswap v3 core.Uniswap Whitepaper, 2021

    Hayden Adams, Noah Zinsmeister, Moody Salem, River Keefer, and Dan Robinson. Uniswap v3 core.Uniswap Whitepaper, 2021

  3. [3]

    Alchemy: The web3 development platform, 2024

    Alchemy. Alchemy: The web3 development platform, 2024. URLhttps://www.alchemy. com/

  4. [4]

    Model context protocol.https://modelcontextprotocol.io/, 2024

    Anthropic. Model context protocol.https://modelcontextprotocol.io/, 2024. Open standard for connecting LLMs with external tools and data sources. 12

  5. [5]

    BNBSmartChainwhitepaper

    BNBChain. BNBSmartChainwhitepaper. https://github.com/bnb-chain/whitepaper, 2020

  6. [6]

    Smart- poc: Generating executable and validated pocs for smart contract bug reports.arXiv preprint arXiv:2511.12993, 2025

    Longfei Chen, Ruibin Yan, Taiyu Wong, Yiyang Chen, Jialai Wang, and Chao Zhang. Smart- poc: Generating executable and validated pocs for smart contract bug reports.arXiv preprint arXiv:2511.12993, 2025

  7. [7]

    Etherscan: The Ethereum blockchain explorer.https://etherscan.io/, 2024

    Etherscan. Etherscan: The Ethereum blockchain explorer.https://etherscan.io/, 2024. Accessed 2026

  8. [8]

    Slither: Astaticanalysisframeworkforsmartcon- tracts.Proceedings of the 2nd International Workshop on Emerging Trends in Software Engineering for Blockchain (WETSEB), 2019

    JosselinFeist, GustavoGrieco, andAlexGroce. Slither: Astaticanalysisframeworkforsmartcon- tracts.Proceedings of the 2nd International Workshop on Emerging Trends in Software Engineering for Blockchain (WETSEB), 2019

  9. [9]

    AI agent smart contract exploit generation.arXiv preprint arXiv:2507.05558, 2025

    Arthur Gervais and Liyi Zhou. AI agent smart contract exploit generation.arXiv preprint arXiv:2507.05558, 2025

  10. [10]

    Echidna: Effective, usable, andfastfuzzingforsmartcontracts

    Gustavo Grieco, Will Song, Artur Cygan, Josselin Feist, and Alex Groce. Echidna: Effective, usable, andfastfuzzingforsmartcontracts. InProceedingsofthe29thACMSIGSOFTInternational Symposium on Software Testing and Analysis (ISSTA), 2020

  11. [11]

    Harbor: Sandboxed agent evaluation framework.https://github.com/ harbor-framework/harbor, 2025

    Harbor Framework. Harbor: Sandboxed agent evaluation framework.https://github.com/ harbor-framework/harbor, 2025. Docker-based evaluation orchestration for LLM agents, accessed 2026

  12. [12]

    Kalodner, Steven Goldfeder, Xiaoqi Chen, S

    Harry A. Kalodner, Steven Goldfeder, Xiaoqi Chen, S. Matthew Weinberg, and Edward W. Felten. Arbitrum: Scalable, private smart contracts.Proceedings of the 27th USENIX Security Symposium, 2018

  13. [13]

    Merrill, Alexander G

    Mike A. Merrill, Alexander G. Shaw, Nicholas Carlini, Boxuan Li, Harsh Raj, Ivan Bercovich, Lin Shi, Jeong Yeon Shin, Thomas Walshe, E. Kelly Buchanan, Junhong Shen, Guanghao Ye, Haowei Lin, Jason Poulos, Maoyu Wang, Marianna Nezhurina, Jenia Jitsev, Di Lu, Orfeas Menis Mastromichalakis, Zhiwei Xu, Zizhao Chen, Yue Liu, Robert Zhang, Leon Liangyu Chen, An...

  14. [14]

    Smashing Ethereum smart contracts for fun and real profit

    Bernhard Mueller. Smashing Ethereum smart contracts for fun and real profit. InHITB Security Conference, 2018

  15. [15]

    EIP-1967: Proxy storage slots.https://eips.ethereum.org/EIPS/ eip-1967, 2019

    Santiago Palladino. EIP-1967: Proxy storage slots.https://eips.ethereum.org/EIPS/ eip-1967, 2019. Ethereum Improvement Proposal

  16. [16]

    Pancakeswap: Decentralized exchange on BNB chain.https://pancakeswap

    PancakeSwap. Pancakeswap: Decentralized exchange on BNB chain.https://pancakeswap. finance, 2021. Accessed 2026. 13

  17. [17]

    Foundry: A blazing fast, portable and modular toolkit for Ethereum application development.https://github.com/foundry-rs/foundry, 2024

    Paradigm. Foundry: A blazing fast, portable and modular toolkit for Ethereum application development.https://github.com/foundry-rs/foundry, 2024. Accessed 2026

  18. [18]

    Polygon: Ethereum’s internet of blockchains.https://polygon.technology/ papers/pol-whitepaper, 2021

    Polygon Labs. Polygon: Ethereum’s internet of blockchains.https://polygon.technology/ papers/pol-whitepaper, 2021

  19. [19]

    LISA technical report: An agentic framework for smart contract auditing.arXiv preprint arXiv:2509.24698, 2025

    Izaiah Sun, Daniel Tan, and Andy Deng. LISA technical report: An agentic framework for smart contract auditing.arXiv preprint arXiv:2509.24698, 2025

  20. [20]

    GPTScan: Detecting logic vulnerabilities in smart contracts by combining GPT with program analysis

    Yuqiang Sun, Daoyuan Wu, Yue Xue, Han Liu, Haijun Wang, Zhengzi Xu, Xiaofei Xie, and Yang Liu. GPTScan: Detecting logic vulnerabilities in smart contracts by combining GPT with program analysis. InProceedings of the 46th IEEE/ACM International Conference on Software Engineering (ICSE), 2024

  21. [21]

    Defihacklabs

    SunWeb3Sec. Defihacklabs. https://github.com/SunWeb3Sec/DeFiHackLabs, 2026. GitHub repository, accessed March 7, 2026

  22. [22]

    Sushiswap: Community-governed DEX.https://sushi.com, 2020

    SushiSwap. Sushiswap: Community-governed DEX.https://sushi.com, 2020. Accessed 2026

  23. [23]

    EIP-20: Token standard.https://eips.ethereum

    Fabian Vogelsteller and Vitalik Buterin. EIP-20: Token standard.https://eips.ethereum. org/EIPS/eip-20, 2015. Ethereum Improvement Proposal

  24. [24]

    Lin, Andy Applebaum, Tejal Patwardhan, Alpin Yukseloglu, and Olivia Watkins

    Justin Wang, Andreas Bigger, Xiaohai Xu, Justin W. Lin, Andy Applebaum, Tejal Patwardhan, Alpin Yukseloglu, and Olivia Watkins. EVMbench: Evaluating AI agents on smart contract security.arXiv preprint arXiv:2603.04915, 2026

  25. [25]

    Cy- berGym: Evaluating AI agents’ real-world cybersecurity capabilities at scale.arXiv preprint arXiv:2506.02548, 2025

    Zhun Wang, Tianneng Shi, Jingxuan He, Matthew Cai, Jialin Zhang, and Dawn Song. Cy- berGym: Evaluating AI agents’ real-world cybersecurity capabilities at scale.arXiv preprint arXiv:2506.02548, 2025

  26. [26]

    Exploitgym: Can ai agents turn security vulnerabilities into real attacks?, 2026

    Zhun Wang, Nico Schiller, Hongwei Li, Srijiith Sesha Narayana, Milad Nasr, Nicholas Carlini, Xi- angyu Qi, Eric Wallace, Elie Bursztein, Luca Invernizzi, Kurt Thomas, Yan Shoshitaishvili, Wenbo Guo, Jingxuan He, Thorsten Holz, and Dawn Song. Exploitgym: Can ai agents turn security vulnerabilities into real attacks?, 2026. URLhttps://arxiv.org/abs/2605.11086

  27. [27]

    Ethereum: A secure decentralised generalised transaction ledger.Ethereum Project Yellow Paper, 2014

    Gavin Wood. Ethereum: A secure decentralised generalised transaction ledger.Ethereum Project Yellow Paper, 2014

  28. [28]

    Sc-bench: A large-scale dataset for smart contract auditing

    Shihao Xia, Mengting He, Linhai Song, and Yiying Zhang. Sc-bench: A large-scale dataset for smart contract auditing. In2025 IEEE/ACM International Workshop on Large Language Models for Code (LLM4Code), pages 57–64, 2025. doi: 10.1109/LLM4Code66737.2025.00012

  29. [29]

    SCONE- bench: Smart contracts exploitation benchmark

    Winnie Xiao, Cole Killian, Henry Sleight, Alan Chan, Nicholas Carlini, and Alwin Peng. SCONE- bench: Smart contracts exploitation benchmark. https://red.anthropic.com/2025/ smart-contracts/, 2025. Anthropic Red Team, accessed March 7, 2026

  30. [30]

    Bhatia, Vikram Sivashankar, Yuxuan Bao, Dawn Song, Dan Boneh, Daniel E

    Andy K Zhang, Joey Ji, Celeste Menders, Riya Dulepet, Thomas Qin, Ron Yifeng Wang, Junrong Wu, Kyleen Liao, Jiliang Li, Jinghan Hu, Sara Hong, Nardos Demilew, Shivatmica Murgai, Jason Khiem Tran, Nishka Kacheria, Ethan Jun shen Ho, Denis Liu, Lauren McLane, Olivia Beyer Bruvik, Dai-Rong Han, Seungwoo Kim, Akhil Vyas, Cuiyuanxiu Chen, Ryan Li, Weiran Xu, J...

  31. [31]

    if I had to fix this with a minimal code change, which function would I modify?

    Liyi Zhou, Xihan Xiong, Jens Ernstberger, Stefanos Chaliasos, Zhipeng Wang, Ye Wang, Kaihua Qin, Roger Wattenhofer, Dawn Song, and Arthur Gervais. SoK: Decentralized finance (DeFi) attacks. InProceedings of the 44th IEEE Symposium on Security and Privacy (S&P), 2023. 15 A. Vulnerability Taxonomy Definitions Table 6 defines the five vulnerability types wit...