pith. sign in

arxiv: 2504.12034 · v1 · pith:ZOWGIEPDnew · submitted 2025-04-16 · 💻 cs.SE · cs.CR

OpDiffer: LLM-Assisted Opcode-Level Differential Testing of Ethereum Virtual Machine

Pith reviewed 2026-05-22 20:31 UTC · model grok-4.3

classification 💻 cs.SE cs.CR
keywords EVM securitydifferential testingLLM-assisted testingEthereum Virtual Machinesmart contract bugsopcode testingbug detectionblockchain security
0
0 comments X

The pith

OpDiffer uses LLMs to generate opcode test cases that expose 26 new bugs across nine Ethereum Virtual Machines.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents OpDiffer, a framework that generates test inputs for the Ethereum Virtual Machine at the opcode level with the help of large language models and then runs those inputs on multiple EVM implementations to find differences. It addresses two main gaps in earlier work: test cases that often had invalid semantics or lacked variety, and the lack of automatic ways to spot bugs and trace their causes. The central idea is that comparing how different EVMs respond to the same input can reliably surface implementation errors that affect smart contract behavior or network stability. Evaluation across nine EVMs produced 26 previously unknown bugs, most confirmed by their developers, along with large gains in code coverage over existing methods. The authors also estimate that over seven percent of real Ethereum contracts could encounter conditions that trigger these bugs.

Core claim

OpDiffer is a differential testing framework that combines LLMs with static analysis to produce semantically valid opcode sequences and automatically detect and localize bugs by observing inconsistent execution results across distinct EVM implementations. The framework was applied to nine EVMs and identified 26 previously unknown bugs, of which 22 were confirmed by developers and three received CNVD identifiers. The same evaluation showed coverage improvements of up to 71 percent, 148 percent, and 655 percent relative to prior baselines, and analysis of deployed contracts indicated that 7.21 percent could trigger the discovered bugs under certain conditions.

What carries the argument

OpDiffer, the differential testing framework that uses LLMs to synthesize opcode-level inputs and static analysis to identify root causes of behavioral divergences between EVM implementations.

If this is right

  • Developers of individual EVMs receive concrete, reproducible test cases that expose implementation errors before deployment.
  • Higher code coverage during testing increases the chance of catching security problems that could cause inconsistent smart-contract outcomes.
  • An estimated 7.21 percent of deployed contracts may encounter triggering conditions for the identified bugs under specific network settings.
  • Routine use of the approach would reduce the risk of denial-of-service or unexpected behavior propagating through the Ethereum network.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same LLM-driven input generation could be applied to virtual machines of other blockchains that share similar opcode structures.
  • Combining these differential results with on-chain monitoring might allow early detection of contracts that would hit the bugs in practice.
  • If the generated tests can be turned into a public regression suite, they would provide ongoing protection as new EVM versions are released.

Load-bearing premise

Behavioral differences observed when running LLM-generated opcode sequences on different EVMs reliably signal real bugs instead of valid implementation choices or false positives.

What would settle it

A follow-up review in which developers reject most of the reported differences as non-bugs, or in which the same inputs produce no observable failures when replayed on a live Ethereum node, would show the method over-reports issues.

Figures

Figures reproduced from arXiv: 2504.12034 by Haoyu Wang, Jie Ma, Jinwen Xi, Mingzhe Xing, Ningyu He, Ying Gao, Yinliang Yue.

Figure 1
Figure 1. Figure 1: A motivating example about the bug of the [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: An example for illustrating a semantically-invalid case. [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: The workflow overview of OpDiffer. differential testing, OpDiffer will initiate the execution context for all EVMs. After the execution, the results and runtime data from instrumented EVMs will be parsed in a uniform format for bug identification, which will be further utilized for root cause localization. At last, OpDiffer will output the corresponding bug reports. 4.2 LLM-Assisted Test Input Generation I… view at source ↗
Figure 4
Figure 4. Figure 4: Prompt for generating seed generator of the [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: A generated seed generator for the BYTE opcode. 1, 2, 3, 4, 5 7, 8, 9, 10, 11 6 12, 13 Path 1 Path 2 index < 32 index 32 (a) Without considering ICFG. 1, 2 3 Stack Underflow 4 Stack Underflow 5 Out of Gas 6, 12 7, 8, 9 10, 11, 12 Stack Overflow 13 len(stack) > 0 len(stack) 0 len(stack) > 0 len(stack) 0 gas_left 3 gas_left < 3 index < 32 index 32 len(stack) < 1024 len(stack) 1024 13 len(stack) < 1024 Path 1… view at source ↗
Figure 6
Figure 6. Figure 6: The control flow of BYTE in EVM, where the specification and line numbers in [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Prompt for implementing control-flow-oriented mutation for opcode [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Coverage results of OpDiffer, EVMFuzzer, NeoDiff and FuzzyVM on Geth and SealEVM. Bars represent the achieved code coverage, while lines represent the coverage improvement by which OpDiffer exceeds EVMFuzzer, NeoDiff and FuzzyVM. lack of diversity of test inputs. Although baselines may generate more test inputs than OpDiffer, their generated inputs cannot further improve the coverage across target EVM impl… view at source ↗
Figure 9
Figure 9. Figure 9: Case study #1: the buggy implementation of [PITH_FULL_IMAGE:figures/full_fig_p018_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Case study #2: the buggy implementation of [PITH_FULL_IMAGE:figures/full_fig_p018_10.png] view at source ↗
read the original abstract

As Ethereum continues to thrive, the Ethereum Virtual Machine (EVM) has become the cornerstone powering tens of millions of active smart contracts. Intuitively, security issues in EVMs could lead to inconsistent behaviors among smart contracts or even denial-of-service of the entire blockchain network. However, to the best of our knowledge, only a limited number of studies focus on the security of EVMs. Moreover, they suffer from 1) insufficient test input diversity and invalid semantics; and 2) the inability to automatically identify bugs and locate root causes. To bridge this gap, we propose OpDiffer, a differential testing framework for EVM, which takes advantage of LLMs and static analysis methods to address the above two limitations. We conducted the largest-scale evaluation, covering nine EVMs and uncovering 26 previously unknown bugs, 22 of which have been confirmed by developers and three have been assigned CNVD IDs. Compared to state-of-the-art baselines, OpDiffer can improve code coverage by at most 71.06%, 148.40% and 655.56%, respectively. Through an analysis of real-world deployed Ethereum contracts, we estimate that 7.21% of the contracts could trigger our identified EVM bugs under certain environmental settings, potentially resulting in severe negative impact on the Ethereum ecosystem.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper introduces OpDiffer, an LLM-assisted differential testing framework for Ethereum Virtual Machines that combines LLM-generated opcode sequences with static analysis to produce semantically valid inputs and automatically identify behavioral divergences across EVM implementations. It reports the largest-scale evaluation on nine EVMs, uncovering 26 previously unknown bugs (22 confirmed by developers, three assigned CNVD IDs), substantial coverage gains over baselines (up to 71.06%, 148.40%, 655.56%), and an estimate that 7.21% of real-world contracts could trigger the identified bugs.

Significance. If the bug identifications prove sound and the coverage claims reproducible, the work would advance automated security testing for EVMs by addressing input diversity and root-cause localization limitations of prior studies. The scale across nine implementations and the real-world impact estimate add practical value to the Ethereum ecosystem.

major comments (3)
  1. [Abstract] Abstract: The central claims of discovering 26 bugs and achieving specific coverage improvements supply no details on validation procedures, how false positives were ruled out, or how coverage was measured (e.g., tool, metric, or baseline configurations), so the empirical results cannot be assessed.
  2. [Method] Method and evaluation description: The differential testing procedure treats any observable behavioral divergence on LLM-generated inputs as a bug, but provides no independent oracle or reference semantics (e.g., Yellow Paper model) to distinguish errors from allowed implementation variations or spec ambiguities; developer confirmation occurs post-hoc and does not establish that inputs lie in the defined behavior space.
  3. [Evaluation] Evaluation results: The reported coverage improvements (71.06%, 148.40%, 655.56%) and bug counts lack any description of the measurement methodology or controls for input validity, undermining the comparison to state-of-the-art baselines and the claim of largest-scale evaluation.
minor comments (1)
  1. [Abstract] The abstract mentions nine EVMs and real-world contract analysis but does not name the specific EVM implementations or the contract dataset used, which would aid immediate understanding.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment below, providing clarifications and indicating planned revisions where appropriate to improve the presentation of our results.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claims of discovering 26 bugs and achieving specific coverage improvements supply no details on validation procedures, how false positives were ruled out, or how coverage was measured (e.g., tool, metric, or baseline configurations), so the empirical results cannot be assessed.

    Authors: We agree that the abstract is concise and omits key methodological details. In the revised manuscript, we will expand the abstract with brief statements noting that the 26 bugs were validated via developer confirmations (22 cases) and CNVD assignments (3 cases), and that coverage gains were measured using standard profiling tools with explicit baseline configurations as described in Section 5. Full validation procedures and measurement details remain in Sections 4 and 5. This change will make the central claims more assessable while preserving abstract length. revision: yes

  2. Referee: [Method] Method and evaluation description: The differential testing procedure treats any observable behavioral divergence on LLM-generated inputs as a bug, but provides no independent oracle or reference semantics (e.g., Yellow Paper model) to distinguish errors from allowed implementation variations or spec ambiguities; developer confirmation occurs post-hoc and does not establish that inputs lie in the defined behavior space.

    Authors: We respectfully note that our approach follows standard differential testing practice for systems like the EVM, where the Yellow Paper specification is informal and contains known ambiguities. Divergences are not automatically labeled as bugs; they undergo manual triage and are only reported after developer confirmation, which serves as domain-expert validation that the input triggers unintended behavior. We will add a clarifying paragraph in the Method section explaining this rationale, the role of static analysis in ensuring semantic validity, and why a formal reference oracle is not feasible or necessary here. This addresses the concern on substance without requiring changes to the core methodology. revision: partial

  3. Referee: [Evaluation] Evaluation results: The reported coverage improvements (71.06%, 148.40%, 655.56%) and bug counts lack any description of the measurement methodology or controls for input validity, undermining the comparison to state-of-the-art baselines and the claim of largest-scale evaluation.

    Authors: We acknowledge that additional explicit details would strengthen reproducibility. In the revised Evaluation section, we will add a dedicated subsection describing the coverage measurement tools and metrics employed for each of the nine EVMs, the exact baseline configurations used for comparison, and the input validity controls provided by the static analysis component. These additions will directly support the reported coverage figures and the largest-scale evaluation claim. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical tool evaluation with no derivations or fitted predictions

full rationale

The paper describes an empirical differential testing framework (OpDiffer) that uses LLMs and static analysis to generate opcode sequences, runs them across nine EVM implementations, and reports observed behavioral differences as bugs (with post-hoc developer confirmation). No equations, parameters, uniqueness theorems, or first-principles derivations appear in the provided text. The central claims rest on experimental outcomes rather than any reduction of a 'prediction' to an input quantity defined by the authors. This is a standard self-contained experimental report; the skeptic concern about whether differences constitute bugs is a question of external validity, not circularity in the derivation chain.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical software-testing study; the central claims rest on experimental outcomes rather than mathematical axioms, free parameters, or newly postulated entities.

pith-pipeline@v0.9.0 · 5784 in / 1296 out tokens · 81980 ms · 2026-05-22T20:31:47.900759+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

70 extracted references · 70 canonical work pages

  1. [1]

    Blue Alloy. 2025. Github revm repository. Retrieved 2025-04-15 from https://github.com/bluealloy/revm

  2. [2]

    Ether Alpha. 2025. Ethereum Client Diversity. Retrieved 2025-04-15 from https://clientdiversity.org

  3. [3]

    2021.EIP-3540: EOF - EVM Object Format v1

    Alex Beregszaszi, Paweł Bylica, Andrei Maiboroda, and Matt Garnett. 2021.EIP-3540: EOF - EVM Object Format v1 . Retrieved 2025-04-15 from https://eips.ethereum.org/EIPS/eip-3540

  4. [4]

    Lukas Bernhard, Tobias Scharnowski, Moritz Schloegel, Tim Blazytko, and Thorsten Holz. 2022. JIT-Picking: Differential Fuzzing of JavaScript Engines. In Proceedings of the 2022 ACM SIGSAC Conference on Computer and Communications Security (Los Angeles, CA, USA) (CCS ’22). Association for Computing Machinery, New York, NY, USA, 351–364. doi:10.1145/3548606.3560624

  5. [5]

    Chad Brubaker, Suman Jana, Baishakhi Ray, Sarfraz Khurshid, and Vitaly Shmatikov. 2014. Using Frankencerts for Automated Adversarial Testing of Certificate Validation in SSL/TLS Implementations. In 2014 IEEE Symposium on Security and Privacy (SP) (Berkeley, CA, USA). IEEE, 114–129. doi:10.1109/SP.2014.15

  6. [6]

    Vitalik Buterin et al. 2013. Ethereum white paper. GitHub repository 1 (2013), 22–23. Retrieved 2025-04-15 from https://ethereum.org/en/whitepaper

  7. [7]

    Shangtong Cao, Ningyu He, Xinyu She, Yixuan Zhang, Mu Zhang, and Haoyu Wang. 2024. WASMaker: Differential Testing of WebAssembly Runtimes via Semantic-Aware Binary Generation. InProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis (Vienna, Austria) (ISSTA 2024). Association for Computing Machinery, New York, NY, USA...

  8. [8]

    Chu Chen, Pinghong Ren, Zhenhua Duan, Cong Tian, Xu Lu, and Bin Yu. 2023. SBDT: Search-Based Differential Testing of Certificate Parsers in SSL/TLS Implementations. In Proceedings of the 32nd ACM SIGSOFT International Symposium on Software Testing and Analysis (Seattle, WA, USA) (ISSTA 2023). Association for Computing Machinery, New York, NY, USA, 967–979...

  9. [9]

    Yuanliang Chen, Fuchen Ma, Yuanhang Zhou, Yu Jiang, Ting Chen, and Jiaguang Sun. 2023. Tyr: Finding Consensus Failure Bugs in Blockchain System with Behaviour Divergent Model. In 2023 IEEE Symposium on Security and Privacy (SP) (San Francisco, CA, USA). IEEE, 2517–2532. doi:10.1109/SP46215.2023.10179386

  10. [10]

    Yuting Chen, Ting Su, and Zhendong Su. 2019. Deep Differential Testing of JVM Implementations. In 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE ’2019) . 1257–1268. doi:10.1109/ICSE.2019.00127

  11. [11]

    Yuting Chen, Ting Su, Chengnian Sun, Zhendong Su, and Jianjun Zhao. 2016. Coverage-directed differential testing of JVM implementations. In Proceedings of the 37th ACM SIGPLAN Conference on Programming Language Design and Implementation (Santa Barbara, CA, USA) (PLDI ’16). Association for Computing Machinery, New York, NY, USA, 85–99. doi:10.1145/2908080.2908095

  12. [12]

    CoinMarketCap. 2025. Ethereum price today . Retrieved 2025-04-15 from https://coinmarketcap.com/currencies/ ethereum

  13. [13]

    Ethereum community. 2025. Ethereum Improvement Proposals. Retrieved 2025-04-15 from https://ethereum.org/en/eips

  14. [14]

    Ethereum Javascript Community. 2025. Github ethereumjs repository. Retrieved 2025-04-15 from https://github.com/ ethereumjs/ethereumjs-monorepo

  15. [15]

    Dan, Mario Vega, Mukul Kolpe, Spencer Taylor-Brown, and omahs. 2025. State Transition Tests, Ethereum Execution Spec Tests. Retrieved 2025-04-15 from https://ethereum.github.io/execution-spec-tests/main/tutorials/state_transition

  16. [16]

    DappRadar. 2025. Top Ethereum Games. Retrieved 2025-04-15 from https://dappradar.com/rankings/protocol/ethereum/ category/games

  17. [17]

    National Vulnerability Database. 2021. CVE-2021-39137 Detail. Retrieved 2025-04-15 from https://nvd.nist.gov/vuln/ detail/CVE-2021-39137

  18. [18]

    Shihan Dou, Haoxiang Jia, Shenxi Wu, Huiyuan Zheng, Weikang Zhou, Muling Wu, Mingxu Chai, Jessica Fan, Caishuang Huang, Yunbo Tao, Yan Liu, Enyu Zhou, Ming Zhang, Yuhao Zhou, Yueming Wu, Rui Zheng, Ming Wen, Rongxiang Weng, Jingang Wang, Xunliang Cai, Tao Gui, Xipeng Qiu, Qi Zhang, and Xuanjing Huang. 2024. What’s Wrong with Your Code Generated by Large L...

  19. [19]

    Ethereum. 2025. Github evmone repository. Retrieved 2025-04-15 from https://github.com/ethereum/evmone

  20. [20]

    Ethereum. 2025. Github execution-specs repository. Retrieved 2025-04-15 from https://github.com/ethereum/execution- specs

  21. [21]

    Ethereum. 2025. Github Go Ethereum repository. Retrieved 2025-04-15 from https://github.com/ethereum/go-ethereum

  22. [22]

    Ethereum. 2025. Github Py-EVM repository. Retrieved 2025-04-15 from https://github.com/ethereum/py-evm

  23. [23]

    Ethereum.org. 2025. Decentralized finance (DeFi). Retrieved 2025-04-15 from https://ethereum.org/en/defi

  24. [24]

    Ethereum.org. 2025. Ethereum Virtual Machine (EVM) implementations . Retrieved 2025-04-15 from https://ethereum. org/en/developers/docs/evm

  25. [25]

    Ethereum.org. 2025. The history of Ethereum . Retrieved 2025-04-15 from https://ethereum.org/en/history

  26. [26]

    Ethereum.org. 2025. Non-fungible tokens (NFT). Retrieved 2025-04-15 from https://ethereum.org/en/nft

  27. [27]

    Etherscan. 2025. The Ethereum Blockchain Explorer . Retrieved 2025-04-15 from https://etherscan.io

  28. [28]

    Ying Fu, Meng Ren, Fuchen Ma, Heyuan Shi, Xin Yang, Yu Jiang, Huizhong Li, and Xiang Shi. 2019. EVMFuzzer: detect EVM vulnerabilities via fuzz testing. In Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (Tallinn, Estonia) (ESEC/FSE 2019). Association for Co...

  29. [29]

    Google. 2025. Coverage profiling support for integration tests. Retrieved 2025-04-15 from https://go.dev/doc/build-cover

  30. [30]

    Neville Grech, Lexi Brent, Bernhard Scholz, and Yannis Smaragdakis. 2019. Gigahorse: Thorough, Declarative Decompilation of Smart Contracts. In 2019 IEEE/ACM 41st International Conference on Software Engineering . 1176–1186. doi:10.1109/ICSE.2019.00120

  31. [31]

    Qiuhan Gu. 2023. LLM-Based Code Generation Method for Golang Compiler Testing. InProceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (San Francisco CA USA, 2023-11-30) (ESEC/FSE 2023). Association for Computing Machinery, 2201–2203. doi:10.1145/3611643.3617850

  32. [32]

    Ningyu He, Ruiyi Zhang, Haoyu Wang, Lei Wu, Xiapu Luo, Yao Guo, Ting Yu, and Xuxian Jiang. 2021. EOSAFE: Security Analysis of EOSIO Smart Contracts. In 30th USENIX Security Symposium (USENIX Security 21) . USENIX Association, 1271–1288. https://www.usenix.org/conference/usenixsecurity21/presentation/he-ningyu

  33. [33]

    Hyperledger. 2025. Github Besu Ethereum Client repository. Retrieved 2025-04-15 from https://github.com/hyperledger/ besu/ ISSTA069:22 Jie Ma, Ningyu He, Jinwen Xi, Mingzhe Xing, Haoyu Wang, Ying Gao, and Yinliang Yue

  34. [34]

    Bo Jiang, Ye Liu, and W. K. Chan. 2018. ContractFuzzer: fuzzing smart contracts for vulnerability detection. In Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering (Montpellier, France) (ASE ’18). Association for Computing Machinery, New York, NY, USA, 259–269. doi:10.1145/3238147.3238177

  35. [35]

    Shinhae Kim and Sungjae Hwang. 2023. EtherDiffer: Differential Testing on RPC Services of Ethereum Nodes. In Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (San Francisco, CA, USA) (ESEC/FSE 2023). Association for Computing Machinery, New York, NY, USA, 1333–1344. doi:10....

  36. [36]

    Kai Li, Jiaqi Chen, Xianghong Liu, Yuzhe Richard Tang, XiaoFeng Wang, and Xiapu Luo. 2021. As Strong As Its Weakest Link: How to Break Blockchain DApps at RPC Service. In28th Annual Network and Distributed System Security Symposium, NDSS 2021, virtually, February 21-25, 2021 . The Internet Society. https://www.ndss-symposium.org/ndss- paper/as-strong-as-i...

  37. [37]

    Li Li, Jiawei Wang, and Haowei Quan. 2022. Scalpel: The Python Static Analysis Framework. arXiv:2202.11840 [cs.SE] https://arxiv.org/abs/2202.11840

  38. [38]

    Tsz-On Li, Wenxi Zong, Yibo Wang, Haoye Tian, Ying Wang, Shing-Chi Cheung, and Jeff Kramer. 2023. Nuances are the Key: Unlocking ChatGPT to Find Failure-Inducing Tests with Differential Prompting. In 2023 38th IEEE/ACM International Conference on Automated Software Engineering (ASE ’23) . 14–26. doi:10.1109/ASE56229.2023.00089

  39. [39]

    Wen Li, Haoran Yang, Xiapu Luo, Long Cheng, and Haipeng Cai. 2023. PyRTFuzz: Detecting Bugs in Python Runtimes via Two-Level Collaborative Fuzzing. In Proceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security (Copenhagen, Denmark) (CCS ’23). Association for Computing Machinery, New York, NY, USA, 1645–1659. doi:10.1145/3576915.3623166

  40. [40]

    Zhe Liu, Chunyang Chen, Junjie Wang, Mengzhuo Chen, Boyu Wu, Xing Che, Dandan Wang, and Qing Wang. 2024. Make LLM a Testing Expert: Bringing Human-like Interaction to Mobile GUI Testing via Functionality-aware Decisions. In Proceedings of the IEEE/ACM 46th International Conference on Software Engineering (Lisbon, Portugal) (ICSE ’24). Association for Comp...

  41. [41]

    Fuchen Ma, Yuanliang Chen, Meng Ren, Yuanhang Zhou, Yu Jiang, Ting Chen, Huizhong Li, and Jiaguang Sun. 2023. LOKI: State-Aware Fuzzing Framework for the Implementation of Blockchain Consensus Protocols. In30th Annual Network and Distributed System Security Symposium, NDSS 2023, San Diego, California, USA, February 27 - March 3,

  42. [42]

    https://www.ndss-symposium.org/ndss-paper/loki-state-aware-fuzzing-framework-for- the-implementation-of-blockchain-consensus-protocols/

    The Internet Society. https://www.ndss-symposium.org/ndss-paper/loki-state-aware-fuzzing-framework-for- the-implementation-of-blockchain-consensus-protocols/

  43. [43]

    Jie Ma. 2025. OpDiffer: LLM-Assisted Opcode-Level Differential Testing of Ethereum Virtual Machine. doi:10.5281/zenodo. 15195943

  44. [44]

    Pengxiang Ma, Ningyu He, Yuhua Huang, Haoyu Wang, and Xiapu Luo. 2024. Abusing the Ethereum Smart Contract Verification Services for Fun and Profit. In31st Annual Network and Distributed System Security Symposium, NDSS 2024, San Diego, California, USA, February 26 - March 1, 2024 . The Internet Society. https://www.ndss-symposium.org/ndss- paper/abusing-t...

  45. [45]

    Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, Shashank Gupta, Bodhisattwa Prasad Majumder, Katherine Hermann, Sean Welleck, Amir Yazdanbakhsh, and Peter Clark. 2023. SELF-REFINE: iterative refinement with self-feedback. In Proceedings of the 37th International ...

  46. [46]

    Dominik Maier, Fabian Fäßler, and Jean-Pierre Seifert. 2022. Uncovering Smart Contract VM Bugs Via Differen- tial Fuzzing. In Reversing and Offensive-Oriented Trends Symposium (Vienna, Austria) (ROOTS’21). Association for Computing Machinery, New York, NY, USA, 11–22. doi:10.1145/3503921.3503923

  47. [47]

    Marius van der Wijden Martin Holst Swende. 2020. EIP-3155: EVM trace specification [DRAFT] . Retrieved 2025-04-15 from https://eips.ethereum.org/EIPS/eip-3155

  48. [48]

    NethermindEth. 2025. Github Nethermind Ethereum client repository . Retrieved 2025-04-15 from https://github.com/ NethermindEth/nethermind

  49. [49]

    Beijing Academy of Blockchain and Edge Computing. 2025. chainmaker document . Retrieved 2025-04-15 from https://docs.chainmaker.org.cn

  50. [50]

    Theofilos Petsios, Adrian Tang, Salvatore Stolfo, Angelos D Keromytis, and Suman Jana. 2017. Nezha: Efficient domain-independent differential testing. In 2017 IEEE Symposium on security and privacy (SP) (Berkeley, CA, USA). IEEE, 615–632. doi:10.1109/SP.2017.27

  51. [51]

    Moritz Schloegel, Nils Bars, Nico Schiller, Lukas Bernhard, Tobias Scharnowski, Addison Crump, Arash Ale-Ebrahim, Nicolai Bissantz, Marius Muench, and Thorsten Holz. 2024. SoK: Prudent Evaluation Practices for Fuzzing. In2024 IEEE Symposium on Security and Privacy (SP) (San Francisco, CA, USA). IEEE, 1974–1993. doi:10.1109/SP54263.2024.00137

  52. [52]

    SealSC. 2025. Github SealEVM repository. Retrieved 2025-04-15 from https://github.com/SealSC/SealEVM OpDiffer: LLM-Assisted Opcode-Level Differential Testing of Ethereum Virtual Machine ISSTA069:23

  53. [53]

    Chaofan Shou, Shangyin Tan, and Koushik Sen. 2023. ItyFuzz: Snapshot-Based Fuzzer for Smart Contract. InProceedings of the 32nd ACM SIGSOFT International Symposium on Software Testing and Analysis (Seattle, WA, USA) (ISSTA 2023). Association for Computing Machinery, New York, NY, USA, 322–333. doi:10.1145/3597926.3598059

  54. [54]

    smlXL Inc. 2025. An Ethereum Virtual Machine Opcodes Interactive Reference . Retrieved 2025-04-15 from https: //www.evm.codes/?fork=cancun

  55. [55]

    Tianle Sun, Ningyu He, Jiang Xiao, Yinliang Yue, Xiapu Luo, and Haoyu Wang. 2024. All Your Tokens are Belong to Us: Demystifying Address Verification Vulnerabilities in Solidity Smart Contracts. In33rd USENIX Security Sympo- sium (USENIX Security 24) . USENIX Association, Philadelphia, PA, 3567–3584. https://www.usenix.org/conference/ usenixsecurity24/pre...

  56. [56]

    Yuqiang Sun, Daoyuan Wu, Yue Xue, Han Liu, Haijun Wang, Zhengzi Xu, Xiaofei Xie, and Yang Liu. 2024. GPTScan: Detecting Logic Vulnerabilities in Smart Contracts by Combining GPT with Program Analysis. In Proceedings of the IEEE/ACM 46th International Conference on Software Engineering (Lisbon, Portugal) (ICSE ’24). Association for Computing Machinery, New...

  57. [57]

    Martin Holst Swende. 2025. Github Go evmlab repository . Retrieved 2025-04-15 from https://github.com/holiman/ goevmlab

  58. [58]

    Christof Ferreira Torres, Antonio Ken Iannillo, Arthur Gervais, and Radu State. 2021. ConFuzzius: A Data Dependency- Aware Hybrid Fuzzer for Smart Contracts. In2021 IEEE European Symposium on Security and Privacy (EuroS&P) (Vienna, Austria). IEEE, 103–119. doi:10.1109/EuroSP51992.2021.00018

  59. [59]

    Petar Tsankov, Andrei Dan, Dana Drachsler-Cohen, Arthur Gervais, Florian Bünzli, and Martin Vechev. 2018. Securify: Practical Security Analysis of Smart Contracts. In Proceedings of the 2018 ACM SIGSAC Conference on Computer and Communications Security (Toronto, Canada) (CCS ’18). Association for Computing Machinery, New York, NY, USA, 67–82. doi:10.1145/...

  60. [60]

    Marius van der Wijden. 2025. Github FuzzyVM repository . Retrieved 2025-04-15 from https://github.com/ MariusVanDerWijden/FuzzyVM

  61. [61]

    Sam Wilson. 2023. Ethereum Execution Layer Specification . Retrieved 2025-04-15 from https://blog.ethereum.org/2023/ 08/29/eel-spec

  62. [62]

    Winter, Florena Buse, Daan de Graaf, Klaus von Gleissenthall, and Burcu Kulahcioglu Ozkan

    Levin N. Winter, Florena Buse, Daan de Graaf, Klaus von Gleissenthall, and Burcu Kulahcioglu Ozkan. 2023. Randomized Testing of Byzantine Fault Tolerant Algorithms. Proc. ACM Program. Lang. 7, OOPSLA1, Article 101 (April 2023), 32 pages. doi:10.1145/3586053

  63. [63]

    Gavin Wood et al. 2014. Ethereum: A secure decentralised generalised transaction ledger. Ethereum project yellow paper 151, 2014 (2014), 1–32

  64. [64]

    Shuohan Wu, Zihao Li, Luyi Yan, Weimin Chen, Muhui Jiang, Chenxu Wang, Xiapu Luo, and Hao Zhou. 2024. Are We There Yet? Unraveling the State-of-the-Art Smart Contract Fuzzers. In Proceedings of the IEEE/ACM 46th International Conference on Software Engineering (Lisbon, Portugal) (ICSE ’24). Association for Computing Machinery, New York, NY, USA, Article 1...

  65. [65]

    Zhiyi Xue, Liangguo Li, Senyue Tian, Xiaohong Chen, Pingping Li, Liangyu Chen, Tingting Jiang, and Min Zhang

  66. [66]

    In Proceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis (Vienna, Austria) (ISSTA 2024)

    LLM4Fin: Fully Automating LLM-Powered Test Case Generation for FinTech Software Acceptance Testing. In Proceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis (Vienna, Austria) (ISSTA 2024). Association for Computing Machinery, New York, NY, USA, 1643–1655. doi:10.1145/3650212.3680388

  67. [67]

    Youngseok Yang, Taesoo Kim, and Byung-Gon Chun. 2021. Finding Consensus Bugs in Ethereum via Multi-transaction Differential Fuzzing. In 15th USENIX Symposium on Operating Systems Design and Implementation (OSDI 21) . USENIX Association, 349–365. https://www.usenix.org/conference/osdi21/presentation/yang

  68. [68]

    Wuqi Zhang, Zhuo Zhang, Qingkai Shi, Lu Liu, Lili Wei, Yepang Liu, Xiangyu Zhang, and Shing-Chi Cheung. 2024. Nyx: Detecting Exploitable Front-Running Vulnerabilities in Smart Contracts. In 2024 IEEE Symposium on Security and Privacy (SP) (San Francisco, CA, USA). IEEE, 2198–2216. doi:10.1109/SP54263.2024.00146

  69. [69]

    Zhijie Zhong, Zibin Zheng, Hong-Ning Dai, Qing Xue, Junjia Chen, and Yuhong Nan. 2024. PrettySmart: Detecting Permission Re-delegation Vulnerability for Token Behaviors in Smart Contracts. In Proceedings of the IEEE/ACM 46th International Conference on Software Engineering (Lisbon, Portugal) (ICSE ’24). Association for Computing Machinery, New York, NY, U...

  70. [70]

    Shiyao Zhou, Muhui Jiang, Weimin Chen, Hao Zhou, Haoyu Wang, and Xiapu Luo. 2024. WADIFF: A Differential Testing Framework for WebAssembly Runtimes. In Proceedings of the 38th IEEE/ACM International Conference on Automated Software Engineering (Echternach, Luxembourg) (ASE ’23). IEEE, 939–950. doi:10.1109/ASE56229.2023.00188 Received 2024-10-31; accepted ...