pith. sign in

arxiv: 2601.06565 · v6 · submitted 2026-01-10 · 💻 cs.CL

EVM-QuestBench: An Execution-Grounded Benchmark for Natural-Language Transaction Code Generation

Pith reviewed 2026-05-16 15:10 UTC · model grok-4.3

classification 💻 cs.CL
keywords EVMbenchmarkLLM evaluationtransaction code generationnatural language to codeblockchain safetysmart contract generationdynamic evaluation
0
0 comments X

The pith

EVM-QuestBench shows language models handle single EVM actions better than full multi-step transaction workflows.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces EVM-QuestBench, an execution-grounded benchmark for turning natural language instructions into correct transaction scripts on EVM-compatible blockchains. Tasks are drawn from templates with randomized parameters, then the generated code runs on a forked chain where validators check whether the actual outcomes match the expected results. Evaluation across 20 models reveals large gaps, with models proving more reliable at isolated steps than at completing entire sequences. This matters because even small mistakes in on-chain transactions can produce permanent loss of funds.

Core claim

We introduce EVM-QuestBench, an execution-grounded benchmark for natural-language transaction-script generation on EVM-compatible chains. The benchmark employs dynamic evaluation: instructions are sampled from template pools, numeric parameters are drawn from predefined intervals, and validators verify outcomes against these instantiated values. EVM-QuestBench contains 107 tasks (62 atomic, 45 composite). Its modular architecture enables rapid task development. The runner executes scripts on a forked EVM chain with snapshot isolation; composite tasks apply step-efficiency decay. We evaluate 20 models and find large performance gaps, with split scores revealing persistent asymmetry between单行动

What carries the argument

EVM-QuestBench benchmark with dynamic sampling from task templates, parameter randomization, execution on a forked EVM chain, and outcome validation to measure both correctness and safety of generated transaction code.

If this is right

  • Models show better results on atomic tasks than on composite ones, pointing to a specific weakness in handling sequential dependencies.
  • Execution-based checking on a live forked chain catches errors that static code analysis would miss.
  • The template-driven, modular design makes it straightforward to add new task types without rebuilding the entire system.
  • Step-efficiency decay in scoring penalizes solutions that use unnecessary intermediate steps.
  • Large performance gaps across 20 models indicate that current training regimes do not yet produce reliable on-chain automation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Better scores on this benchmark could translate into fewer irreversible losses when users rely on language models to construct DeFi transactions.
  • The single-action versus multi-step asymmetry suggests training data should emphasize long-range dependencies across transaction steps.
  • The same dynamic evaluation pattern could be applied to other blockchain environments to test cross-platform reliability.
  • Deploying such models in production would still require additional runtime safeguards beyond benchmark performance.

Load-bearing premise

The 107 tasks sampled from templates and validated on a forked chain are assumed to represent the diversity and safety requirements of real-world EVM transaction scenarios.

What would settle it

If models that score highly on EVM-QuestBench still produce scripts that fail or lose funds when executed on mainnet with real assets, the benchmark's claim to measure safe transaction generation would be refuted.

Figures

Figures reproduced from arXiv: 2601.06565 by Eric Yang, Ke Wang, Lynn Ai, Pei Yang, Tianyu Shi, Wanyi Chen.

Figure 1
Figure 1. Figure 1: EVM-QuestBench evaluation architecture and end-to-end pipeline. A natural language instruction is sampled from a template pool with dynamic numeric parameters, passed to the LLM for TypeScript script generation, executed on a snapshot-isolated forked chain, and scored by task-specific validators against post-state constraints. Composite tasks additionally apply step-efficiency decay. optimal_steps, and enu… view at source ↗
Figure 2
Figure 2. Figure 2: End-to-end evaluation pipeline of EVM-QuestBench. A natural language instruction is instantiated from a template with dynamically sampled parameters, fed to the LLM for TypeScript script generation, executed by the runner on a snapshot-isolated forked chain, and finally scored by task-specific validators against post-execution on-chain state. account for AMM non-determinism. Composite validators score the … view at source ↗
Figure 3
Figure 3. Figure 3: Total score (avg@5) versus total token usage (single run, 107 tasks). Colour encodes API cost (USD, with prompt caching). Models in the upper-left quadrant achieve high scores with low token budgets. Flash at $0.29), consuming under 420K tokens. Thinking-enabled models are substantially more expensive (up to $14.16 for Gemini-3-Pro) due to chain-of-thought token overhead, yet do not consistently outper￾for… view at source ↗
Figure 4
Figure 4. Figure 4: Task split in EVM-QuestBench: 62 atomic tasks and 45 composite tasks (107 total). 2 3 4 5 6 Optimal Steps (Kopt) 0 5 10 15 20 25 Number of Tasks 11 24 7 3 mean = 3.3 [PITH_FULL_IMAGE:figures/full_fig_p016_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Distribution of optimal steps (Kopt) for composite tasks. Most workflows require 3 steps (53.3%), with a mean of 3.3. C Reproducibility This appendix summarizes the setup required to reproduce EVM-QuestBench runs and the experimental settings that affect run to run variance. For strict reproducibility, record the fork block height, model sampling parameters, and the task parameter random seed. C.1 Environm… view at source ↗
Figure 6
Figure 6. Figure 6: Atomic score versus Composite score (avg@5). Each point is a model. C.2 Execution Commands The following commands run the benchmark for a given model identifier. C.3 Key Experimental Settings [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Composite workflow difficulty by pattern. Pass is defined as score ≥ 60. D.1 Runner Interface Each model outputs a TypeScript module exporting an entry function. The runner provides providerUrl, the agent EOA address, and a contract address map for the local fork. export async function executeSkill( providerUrl: string, agentAddress: string, deployedContracts: Record<string, string> ): Promise<Record<strin… view at source ↗
Figure 8
Figure 8. Figure 8: Model ranking by total score (avg@5, 20 models). H.5 Atomic Subcategory Difficulty [PITH_FULL_IMAGE:figures/full_fig_p023_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Pass rate by composite workflow pattern for the ten most frequent patterns. Profile Count Interpretation High-High 7 strong precision and workflows High-Low 3 strong precision, weaker workflows Low-High 3 weaker precision, stronger workflows Low-Low 3 weaker on both splits Atomic-Only 4 composite split near zero [PITH_FULL_IMAGE:figures/full_fig_p024_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Distribution of step overhead ∆ = Kact − Kopt. Model Atomic Composite AtomicAvg CompAvg Total Quadrant Claude-Sonnet-4.5 62/62 45/45 69.0 88.0 8235.8 High-High Gemini-3-Pro 62/62 45/45 69.4 79.1 7863.0 High-High GPT-5 62/62 45/45 66.6 81.0 7773.7 High-High GPT-5.1 62/62 45/45 59.9 80.4 7331.7 High-High Kimi-K2-Thinking 62/62 45/45 58.1 80.7 7238.3 High-High Gemini-2.5-Flash 62/62 45/45 52.7 83.7 7033.4 Hi… view at source ↗
Figure 11
Figure 11. Figure 11: Pass rate by atomic subcategory. I.3 Step Overhead and Decay Impact Summary [PITH_FULL_IMAGE:figures/full_fig_p026_11.png] view at source ↗
read the original abstract

Large language models are increasingly applied to various development scenarios. However, in on-chain transaction scenarios, even a minor error can cause irreversible loss for users. Existing evaluations often overlook execution accuracy and safety. We introduce EVM-QuestBench, an execution-grounded benchmark for natural-language transaction-script generation on EVM-compatible chains. The benchmark employs dynamic evaluation: instructions are sampled from template pools, numeric parameters are drawn from predefined intervals, and validators verify outcomes against these instantiated values. EVM-QuestBench contains 107 tasks (62 atomic, 45 composite). Its modular architecture enables rapid task development. The runner executes scripts on a forked EVM chain with snapshot isolation; composite tasks apply step-efficiency decay. We evaluate 20 models and find large performance gaps, with split scores revealing persistent asymmetry between single-action precision and multi-step workflow completion. Code: https://anonymous.4open.science/r/bsc_quest_bench-A9CF/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces EVM-QuestBench, an execution-grounded benchmark for natural-language generation of EVM transaction scripts. It consists of 107 tasks (62 atomic, 45 composite) constructed via template sampling with numeric parameters drawn from intervals, validated through execution on a forked EVM chain using snapshot isolation and outcome validators. The modular design supports rapid task addition, and composite tasks incorporate step-efficiency decay. Evaluation of 20 models reveals large performance gaps and a persistent asymmetry between single-action precision and multi-step workflow completion.

Significance. If the task set proves representative, the benchmark supplies a reproducible, execution-verified framework that exposes concrete limitations in current LLMs for safety-critical blockchain scripting. The dynamic evaluation protocol and public code release are concrete strengths that could accelerate progress on reliable on-chain code generation.

major comments (2)
  1. [§3] §3 (Benchmark Construction): The assertion that the 107 template-derived tasks (62 atomic + 45 composite) constitute a sufficient proxy for real-world EVM transaction diversity rests on an unverified assumption; no quantitative coverage analysis against mainnet transaction traces, state-dependency statistics, or adversarial patterns (e.g., reentrancy, multi-contract call graphs) is provided, which is load-bearing for the generalizability of the reported performance gaps and single-action vs. multi-step asymmetry.
  2. [§5] §5 (Evaluation): The headline finding of persistent asymmetry is presented via split scores, yet the manuscript supplies no error analysis, per-task-type breakdown, or ablation on validator logic, leaving open the possibility that the observed gap is an artifact of the narrow task distribution rather than a robust model property.
minor comments (2)
  1. [§4] The description of the snapshot-isolated runner and step-efficiency decay mechanism would benefit from a small pseudocode listing or explicit formula for the decay function to improve reproducibility.
  2. [§5] Table or figure captions for the 20-model results should explicitly state the exact metric definitions (e.g., success rate, efficiency-adjusted score) rather than relying solely on prose.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the insightful comments. We respond to each major comment below and indicate planned revisions to the manuscript.

read point-by-point responses
  1. Referee: [§3] §3 (Benchmark Construction): The assertion that the 107 template-derived tasks (62 atomic + 45 composite) constitute a sufficient proxy for real-world EVM transaction diversity rests on an unverified assumption; no quantitative coverage analysis against mainnet transaction traces, state-dependency statistics, or adversarial patterns (e.g., reentrancy, multi-contract call graphs) is provided, which is load-bearing for the generalizability of the reported performance gaps and single-action vs. multi-step asymmetry.

    Authors: We agree that a quantitative coverage analysis would further support the generalizability of our findings. Our task set was constructed to capture core EVM operations through atomic actions like token transfers, approvals, and DEX interactions, with composites building on these. The template sampling with parameter intervals aims to simulate variability in real transactions. However, we did not perform a direct comparison to mainnet traces. In the revision, we will add a dedicated limitations subsection discussing the representativeness of the tasks, citing common transaction types from literature or public data, and explicitly state that the benchmark prioritizes execution-grounded evaluation over exhaustive coverage. We believe the observed performance gaps and asymmetry hold value as they highlight challenges in multi-step reasoning even on these foundational tasks. We cannot provide a full coverage analysis at this stage without additional data collection. revision: partial

  2. Referee: [§5] §5 (Evaluation): The headline finding of persistent asymmetry is presented via split scores, yet the manuscript supplies no error analysis, per-task-type breakdown, or ablation on validator logic, leaving open the possibility that the observed gap is an artifact of the narrow task distribution rather than a robust model property.

    Authors: Thank you for highlighting this. We will revise §5 to include a per-task breakdown of success rates for atomic and composite tasks across all evaluated models. Additionally, we will incorporate an error analysis section categorizing failures (e.g., syntax errors, incorrect sequencing, parameter mismatches) based on execution logs. For the validator logic, the validators check post-execution state against expected outcomes using snapshot isolation, which we argue reduces artifacts; we will add a brief description of this in the revision. While the task distribution is focused, the consistent asymmetry across diverse models (from small to large) suggests it reflects a genuine capability gap rather than an artifact. We will expand the discussion to address potential influences of task design. revision: yes

standing simulated objections not resolved
  • Full quantitative coverage analysis of the task set against mainnet traces and adversarial patterns

Circularity Check

0 steps flagged

No circularity: benchmark construction and empirical evaluation are independent of fitted inputs or self-referential derivations

full rationale

The paper introduces EVM-QuestBench as a new benchmark with 107 tasks (62 atomic, 45 composite) sampled from template pools, parameters drawn from intervals, and validated via execution on a forked EVM chain with snapshot isolation. No equations, parameter fitting, or predictions are described that reduce by construction to the inputs. Model evaluations (20 models) are direct empirical measurements of performance gaps and asymmetry, not derived from prior self-citations or ansatzes. The modular architecture and step-efficiency decay are design choices, not self-definitional reductions. No uniqueness theorems or load-bearing self-citations appear in the provided text. The contribution is the benchmark release itself, which is self-contained and externally falsifiable via the released code and forked-chain runner.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The benchmark rests on the assumption that template-based task generation plus interval sampling produces representative and safe test cases; no explicit free parameters, axioms, or invented entities are stated in the abstract.

pith-pipeline@v0.9.0 · 5472 in / 1108 out tokens · 23974 ms · 2026-05-16T15:10:17.742519+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Intent2Tx: Benchmarking LLMs for Translating Natural Language Intents into Ethereum Transactions

    cs.AI 2026-04 unverdicted novelty 7.0

    Intent2Tx shows that LLMs often generate syntactically valid but functionally incorrect Ethereum transactions, especially on multi-step and out-of-distribution intents, despite gains from scaling and retrieval augmentation.

Reference graph

Works this paper leans on

16 extracted references · 16 canonical work pages · cited by 1 Pith paper · 1 internal anchor

  1. [1]

    Program Synthesis with Large Language Models

    URLhttps://solana.com/news/solana-bench. 12 EVM-QuestBench Jacob Austin, Augustus Odena, Maxwell Nye, et al. Program synthesis with large language models.arXiv preprint arXiv:2108.07732, 2021. URLhttps://arxiv.org/abs/2108.07732. Zhiruo Wang, Shuyan Zhou, Daniel Fried, and Graham Neubig. Execution-based evaluation for open- domain code generation. InFindi...

  2. [2]

    staking pool), the right function signature, and encode calldata that satisfies the ABI

    Correctly identify the target contract and function.The model must select the right protocol (e.g., PancakeSwap Router vs. staking pool), the right function signature, and encode calldata that satisfies the ABI

  3. [3]

    Swaps require slippage-tolerant minimum output values

    Handle chain-specific units.Token amounts must be converted from human-readable form to on-chain representation (e.g., 0.1 × 1018 wei for 18-decimal tokens). Swaps require slippage-tolerant minimum output values

  4. [4]

    The model must identify and include these dependencies

    Satisfy protocol prerequisites.Many operations require a prior approval transaction (ERC-20approve) before the main action can execute. The model must identify and include these dependencies

  5. [5]

    0x...", data:

    Propagate parameters across steps.In multi-step workflows, outputs from earlier steps (e.g., LP token amounts received from liquidity addition) feed into subsequent steps (e.g., staking). The model must track and propagate these values correctly. The benchmark doesnotevaluate contract deployment or Solidity code generation. It specifically targets the cli...

  6. [6]

    Missing exportedexecuteSkill

  7. [7]

    Function signature mismatch

  8. [8]

    Return value is not a transaction like object

  9. [9]

    Missing requiredtofield

  10. [10]

    Serialization failure under ethers.js

  11. [11]

    No valid TypeScript code block when code is required

  12. [12]

    close but off by a few percent

    Control JSON is not parseable in composite control rounds E Task Definition Schema This section summarizes task fields that are most relevant for reproduction and error diagnosis. E.1 Atomic Task Fields Atomic tasks specify one on chain action and are validated by post execution constraints. 19 EVM-QuestBench Field Type Description idstring Unique task id...

  13. [13]

    Queries the current BNB balance viaprovider.getBalance(account)

  14. [14]

    Computesamount = balance * 15n / 100n

  15. [15]

    Step 3: Execution.The runner signs and submits the transaction on the forked chain

    Returns a transaction request:{to: recipient, value: amount}. Step 3: Execution.The runner signs and submits the transaction on the forked chain. The transaction executes successfully (receipt status = 1). The runner records pre-execution and post-execution balances for both the sender and recipient. Step 4: Validation.Thebnb_transfer_percentagevalidator ...

  16. [16]

    Transfer 15% of my BNB balance to 0xA1b2...C3d4

    Balance Change(30 pts): sender balance decreased by amount + gas; recipient balance increased by amount, both within 0.1% tolerance.✓ Final score:30+20+20+30=100 out of 100. Key design points.The expected transfer amount is computed dynamically from the fork state (not hardcoded), so the ground truth is always consistent with the execution environment. Th...