EVM-QuestBench: An Execution-Grounded Benchmark for Natural-Language Transaction Code Generation
Pith reviewed 2026-05-16 15:10 UTC · model grok-4.3
The pith
EVM-QuestBench shows language models handle single EVM actions better than full multi-step transaction workflows.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We introduce EVM-QuestBench, an execution-grounded benchmark for natural-language transaction-script generation on EVM-compatible chains. The benchmark employs dynamic evaluation: instructions are sampled from template pools, numeric parameters are drawn from predefined intervals, and validators verify outcomes against these instantiated values. EVM-QuestBench contains 107 tasks (62 atomic, 45 composite). Its modular architecture enables rapid task development. The runner executes scripts on a forked EVM chain with snapshot isolation; composite tasks apply step-efficiency decay. We evaluate 20 models and find large performance gaps, with split scores revealing persistent asymmetry between单行动
What carries the argument
EVM-QuestBench benchmark with dynamic sampling from task templates, parameter randomization, execution on a forked EVM chain, and outcome validation to measure both correctness and safety of generated transaction code.
If this is right
- Models show better results on atomic tasks than on composite ones, pointing to a specific weakness in handling sequential dependencies.
- Execution-based checking on a live forked chain catches errors that static code analysis would miss.
- The template-driven, modular design makes it straightforward to add new task types without rebuilding the entire system.
- Step-efficiency decay in scoring penalizes solutions that use unnecessary intermediate steps.
- Large performance gaps across 20 models indicate that current training regimes do not yet produce reliable on-chain automation.
Where Pith is reading between the lines
- Better scores on this benchmark could translate into fewer irreversible losses when users rely on language models to construct DeFi transactions.
- The single-action versus multi-step asymmetry suggests training data should emphasize long-range dependencies across transaction steps.
- The same dynamic evaluation pattern could be applied to other blockchain environments to test cross-platform reliability.
- Deploying such models in production would still require additional runtime safeguards beyond benchmark performance.
Load-bearing premise
The 107 tasks sampled from templates and validated on a forked chain are assumed to represent the diversity and safety requirements of real-world EVM transaction scenarios.
What would settle it
If models that score highly on EVM-QuestBench still produce scripts that fail or lose funds when executed on mainnet with real assets, the benchmark's claim to measure safe transaction generation would be refuted.
Figures
read the original abstract
Large language models are increasingly applied to various development scenarios. However, in on-chain transaction scenarios, even a minor error can cause irreversible loss for users. Existing evaluations often overlook execution accuracy and safety. We introduce EVM-QuestBench, an execution-grounded benchmark for natural-language transaction-script generation on EVM-compatible chains. The benchmark employs dynamic evaluation: instructions are sampled from template pools, numeric parameters are drawn from predefined intervals, and validators verify outcomes against these instantiated values. EVM-QuestBench contains 107 tasks (62 atomic, 45 composite). Its modular architecture enables rapid task development. The runner executes scripts on a forked EVM chain with snapshot isolation; composite tasks apply step-efficiency decay. We evaluate 20 models and find large performance gaps, with split scores revealing persistent asymmetry between single-action precision and multi-step workflow completion. Code: https://anonymous.4open.science/r/bsc_quest_bench-A9CF/.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces EVM-QuestBench, an execution-grounded benchmark for natural-language generation of EVM transaction scripts. It consists of 107 tasks (62 atomic, 45 composite) constructed via template sampling with numeric parameters drawn from intervals, validated through execution on a forked EVM chain using snapshot isolation and outcome validators. The modular design supports rapid task addition, and composite tasks incorporate step-efficiency decay. Evaluation of 20 models reveals large performance gaps and a persistent asymmetry between single-action precision and multi-step workflow completion.
Significance. If the task set proves representative, the benchmark supplies a reproducible, execution-verified framework that exposes concrete limitations in current LLMs for safety-critical blockchain scripting. The dynamic evaluation protocol and public code release are concrete strengths that could accelerate progress on reliable on-chain code generation.
major comments (2)
- [§3] §3 (Benchmark Construction): The assertion that the 107 template-derived tasks (62 atomic + 45 composite) constitute a sufficient proxy for real-world EVM transaction diversity rests on an unverified assumption; no quantitative coverage analysis against mainnet transaction traces, state-dependency statistics, or adversarial patterns (e.g., reentrancy, multi-contract call graphs) is provided, which is load-bearing for the generalizability of the reported performance gaps and single-action vs. multi-step asymmetry.
- [§5] §5 (Evaluation): The headline finding of persistent asymmetry is presented via split scores, yet the manuscript supplies no error analysis, per-task-type breakdown, or ablation on validator logic, leaving open the possibility that the observed gap is an artifact of the narrow task distribution rather than a robust model property.
minor comments (2)
- [§4] The description of the snapshot-isolated runner and step-efficiency decay mechanism would benefit from a small pseudocode listing or explicit formula for the decay function to improve reproducibility.
- [§5] Table or figure captions for the 20-model results should explicitly state the exact metric definitions (e.g., success rate, efficiency-adjusted score) rather than relying solely on prose.
Simulated Author's Rebuttal
We thank the referee for the insightful comments. We respond to each major comment below and indicate planned revisions to the manuscript.
read point-by-point responses
-
Referee: [§3] §3 (Benchmark Construction): The assertion that the 107 template-derived tasks (62 atomic + 45 composite) constitute a sufficient proxy for real-world EVM transaction diversity rests on an unverified assumption; no quantitative coverage analysis against mainnet transaction traces, state-dependency statistics, or adversarial patterns (e.g., reentrancy, multi-contract call graphs) is provided, which is load-bearing for the generalizability of the reported performance gaps and single-action vs. multi-step asymmetry.
Authors: We agree that a quantitative coverage analysis would further support the generalizability of our findings. Our task set was constructed to capture core EVM operations through atomic actions like token transfers, approvals, and DEX interactions, with composites building on these. The template sampling with parameter intervals aims to simulate variability in real transactions. However, we did not perform a direct comparison to mainnet traces. In the revision, we will add a dedicated limitations subsection discussing the representativeness of the tasks, citing common transaction types from literature or public data, and explicitly state that the benchmark prioritizes execution-grounded evaluation over exhaustive coverage. We believe the observed performance gaps and asymmetry hold value as they highlight challenges in multi-step reasoning even on these foundational tasks. We cannot provide a full coverage analysis at this stage without additional data collection. revision: partial
-
Referee: [§5] §5 (Evaluation): The headline finding of persistent asymmetry is presented via split scores, yet the manuscript supplies no error analysis, per-task-type breakdown, or ablation on validator logic, leaving open the possibility that the observed gap is an artifact of the narrow task distribution rather than a robust model property.
Authors: Thank you for highlighting this. We will revise §5 to include a per-task breakdown of success rates for atomic and composite tasks across all evaluated models. Additionally, we will incorporate an error analysis section categorizing failures (e.g., syntax errors, incorrect sequencing, parameter mismatches) based on execution logs. For the validator logic, the validators check post-execution state against expected outcomes using snapshot isolation, which we argue reduces artifacts; we will add a brief description of this in the revision. While the task distribution is focused, the consistent asymmetry across diverse models (from small to large) suggests it reflects a genuine capability gap rather than an artifact. We will expand the discussion to address potential influences of task design. revision: yes
- Full quantitative coverage analysis of the task set against mainnet traces and adversarial patterns
Circularity Check
No circularity: benchmark construction and empirical evaluation are independent of fitted inputs or self-referential derivations
full rationale
The paper introduces EVM-QuestBench as a new benchmark with 107 tasks (62 atomic, 45 composite) sampled from template pools, parameters drawn from intervals, and validated via execution on a forked EVM chain with snapshot isolation. No equations, parameter fitting, or predictions are described that reduce by construction to the inputs. Model evaluations (20 models) are direct empirical measurements of performance gaps and asymmetry, not derived from prior self-citations or ansatzes. The modular architecture and step-efficiency decay are design choices, not self-definitional reductions. No uniqueness theorems or load-bearing self-citations appear in the provided text. The contribution is the benchmark release itself, which is self-contained and externally falsifiable via the released code and forked-chain runner.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 1 Pith paper
-
Intent2Tx: Benchmarking LLMs for Translating Natural Language Intents into Ethereum Transactions
Intent2Tx shows that LLMs often generate syntactically valid but functionally incorrect Ethereum transactions, especially on multi-step and out-of-distribution intents, despite gains from scaling and retrieval augmentation.
Reference graph
Works this paper leans on
-
[1]
Program Synthesis with Large Language Models
URLhttps://solana.com/news/solana-bench. 12 EVM-QuestBench Jacob Austin, Augustus Odena, Maxwell Nye, et al. Program synthesis with large language models.arXiv preprint arXiv:2108.07732, 2021. URLhttps://arxiv.org/abs/2108.07732. Zhiruo Wang, Shuyan Zhou, Daniel Fried, and Graham Neubig. Execution-based evaluation for open- domain code generation. InFindi...
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[2]
staking pool), the right function signature, and encode calldata that satisfies the ABI
Correctly identify the target contract and function.The model must select the right protocol (e.g., PancakeSwap Router vs. staking pool), the right function signature, and encode calldata that satisfies the ABI
-
[3]
Swaps require slippage-tolerant minimum output values
Handle chain-specific units.Token amounts must be converted from human-readable form to on-chain representation (e.g., 0.1 × 1018 wei for 18-decimal tokens). Swaps require slippage-tolerant minimum output values
-
[4]
The model must identify and include these dependencies
Satisfy protocol prerequisites.Many operations require a prior approval transaction (ERC-20approve) before the main action can execute. The model must identify and include these dependencies
-
[5]
Propagate parameters across steps.In multi-step workflows, outputs from earlier steps (e.g., LP token amounts received from liquidity addition) feed into subsequent steps (e.g., staking). The model must track and propagate these values correctly. The benchmark doesnotevaluate contract deployment or Solidity code generation. It specifically targets the cli...
work page 2000
-
[6]
Missing exportedexecuteSkill
-
[7]
Function signature mismatch
-
[8]
Return value is not a transaction like object
-
[9]
Missing requiredtofield
-
[10]
Serialization failure under ethers.js
-
[11]
No valid TypeScript code block when code is required
-
[12]
close but off by a few percent
Control JSON is not parseable in composite control rounds E Task Definition Schema This section summarizes task fields that are most relevant for reproduction and error diagnosis. E.1 Atomic Task Fields Atomic tasks specify one on chain action and are validated by post execution constraints. 19 EVM-QuestBench Field Type Description idstring Unique task id...
work page 2000
-
[13]
Queries the current BNB balance viaprovider.getBalance(account)
-
[14]
Computesamount = balance * 15n / 100n
-
[15]
Step 3: Execution.The runner signs and submits the transaction on the forked chain
Returns a transaction request:{to: recipient, value: amount}. Step 3: Execution.The runner signs and submits the transaction on the forked chain. The transaction executes successfully (receipt status = 1). The runner records pre-execution and post-execution balances for both the sender and recipient. Step 4: Validation.Thebnb_transfer_percentagevalidator ...
-
[16]
Transfer 15% of my BNB balance to 0xA1b2...C3d4
Balance Change(30 pts): sender balance decreased by amount + gas; recipient balance increased by amount, both within 0.1% tolerance.✓ Final score:30+20+20+30=100 out of 100. Key design points.The expected transfer amount is computed dynamically from the fork state (not hardcoded), so the ground truth is always consistent with the execution environment. Th...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.