EvoCode-Bench: Evaluating Coding Agents in Multi-Turn Iterative Interactions

Haiyang Shen; Kuan Li; Liang Chen; Wendong Xu; Xuanzhong Chen; Yun Ma

arxiv: 2605.24110 · v1 · pith:5PXD56BAnew · submitted 2026-05-22 · 💻 cs.AI

EvoCode-Bench: Evaluating Coding Agents in Multi-Turn Iterative Interactions

Haiyang Shen , Xuanzhong Chen , Wendong Xu , Yun Ma , Liang Chen , Kuan Li This is my paper

Pith reviewed 2026-06-30 16:19 UTC · model grok-4.3

classification 💻 cs.AI

keywords coding agentsmulti-turn evaluationiterative developmentbenchmarkLLM agentspersistent stateregression testingsoftware engineering

0 comments

The pith

Single-round success rates for coding agents overestimate their ability to maintain codebases across changing requirements by 22-40 points.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces EvoCode-Bench to evaluate coding agents in multi-turn iterative interactions where requirements evolve over rounds while preserving the workspace. It shows that single-round metrics (SR) significantly exceed multi-turn persistent execution scores (MT@4) for most of the 13 agents tested, with gaps of 22-40 points that also alter agent rankings. Even top agents reach only about 50% success on multi-turn metrics, and overall pass rates fall below half of initial performance by round 5. Failure modes differ by agent strength, with weaker ones failing early and stronger ones revealing issues in specification tracking and avoiding regressions. This highlights the need for benchmarks that capture ongoing development rather than isolated tasks.

Core claim

The central claim is that most coding agents perform substantially worse when required to iteratively update and maintain a codebase across multiple rounds with evolving requirements compared to single-round evaluations from a completed state, as measured by the gap between SR and MT@4 on EvoCode-Bench's 26 tasks.

What carries the argument

EvoCode-Bench, consisting of 26 stateful coding tasks evaluated over 5-15 rounds each using cumulative executable tests that verify both new and prior requirements.

If this is right

Agent rankings based on single-round performance do not reliably predict performance in persistent multi-turn settings.
Stronger agents survive longer rounds but still encounter specification-tracking and regression failures.
Aggregate success rates decline sharply after initial rounds, dropping below half of round-1 levels by round 5.
Weaker agents fail predominantly in early rounds while the benchmark exposes later-stage issues in capable agents.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Training or prompting strategies focused on single completions may not generalize to real iterative workflows.
Real-world deployment of coding agents may require additional mechanisms for state persistence and change tracking beyond current capabilities.
The released benchmark data and Harbor infrastructure enable development of agents optimized for multi-turn consistency.

Load-bearing premise

The 26 tasks and their cumulative executable tests are assumed to be representative of real iterative development and to expose general failure modes rather than artifacts of task selection or test design.

What would settle it

Evaluating the same agents on a substantially larger or differently constructed set of multi-turn tasks and finding no significant gap between SR and MT@4 scores, or no ranking changes, would falsify the overestimation claim.

Figures

Figures reproduced from arXiv: 2605.24110 by Haiyang Shen, Kuan Li, Liang Chen, Wendong Xu, Xuanzhong Chen, Yun Ma.

**Figure 1.** Figure 1: Overview of EVOCODE-BENCH. (a) Each round contains an instruction, reference solution, and cumulative tests checked by human review and oracle verification. (b) MT@4 keeps one Docker environment and agent session across rounds, with fail-stop termination. (c) SR fast-forwards to the reference state before each target round. valid, evolving state. EVOCODE-BENCH measures the latter in its main multi-turn sco… view at source ↗

**Figure 2.** Figure 2: Dataset statistics for EVOCODE-BENCH beyond the taxonomy distribution in [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: (a) SR vs. persistent MT@4 for each agent. (b) Per-round pass rate heatmap; rows are agents [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Per-round SR (single attempt) vs. MT@4 (best of four attempts). (a) SR remains stable while [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Failure diagnostics expanded from Section [PITH_FULL_IMAGE:figures/full_fig_p021_5.png] view at source ↗

**Figure 6.** Figure 6: Auxiliary diagnostics for multi-round evaluation. (a) Average interaction turns are not mono [PITH_FULL_IMAGE:figures/full_fig_p027_6.png] view at source ↗

**Figure 7.** Figure 7: Cross-attempt variance decomposition. (a) Per-model aptitude (any attempt passes) versus full [PITH_FULL_IMAGE:figures/full_fig_p029_7.png] view at source ↗

**Figure 8.** Figure 8: Failure mode distribution by round index for each agent tier. Bar heights show the percentage [PITH_FULL_IMAGE:figures/full_fig_p029_8.png] view at source ↗

**Figure 9.** Figure 9: Workspace state penalty. (a) Percentage of MT-failing rounds that are solvable under SR, by [PITH_FULL_IMAGE:figures/full_fig_p030_9.png] view at source ↗

**Figure 10.** Figure 10: Output tokens per round for passing vs. failing trials (controlled for round index). Failing [PITH_FULL_IMAGE:figures/full_fig_p031_10.png] view at source ↗

**Figure 11.** Figure 11: Change-type effect on pass rates, controlling for round position. (a) MT@4 pass rate by change [PITH_FULL_IMAGE:figures/full_fig_p031_11.png] view at source ↗

read the original abstract

Coding agents are increasingly used as iterative development partners, but most benchmarks still evaluate one specification followed by one final assessment. This leaves out a basic question: can an agent keep its own codebase working as requirements change? We introduce EvoCode-Bench, a benchmark of 26 stateful coding tasks and 227 evaluated rounds. Each task preserves the agent's workspace for 5-15 rounds, states requirements through observable behavior, and uses cumulative executable tests to check new requirements and still-active prior ones. We evaluate 13 coding agents with two metrics: MT@4, a four-attempt fail-stop multi-round score, and SR, a single-round score from a reference-completed prior state. For most agents, SR exceeds MT@4 by 22-40 points. The gap also changes rankings: the highest-SR agent (78.9) ranks only third in persistent execution (44.0 MT@4). Even the strongest agents achieve only about 50% success on multi-turn metrics, and aggregate pass rate drops below half of round-1 performance by round 5. Failure analysis shows tier-dependent behavior: weaker agents fail early, while stronger agents survive long enough to expose specification-tracking and regression failures. We release the benchmark data and Harbor multi-turn infrastructure.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The benchmark shows single-round scores miss multi-turn maintenance failures and flip some rankings, but the 26 tasks are the load-bearing assumption.

read the letter

The main thing to know is that single-round coding success does not predict how agents handle evolving requirements over multiple rounds. On this benchmark, SR scores run 22-40 points above MT@4 for most of the 13 agents tested, the top SR performer drops to third on the multi-turn metric, and even the best agents sit around 50% on persistent execution while aggregate pass rates halve by round 5.

What is new is the explicit focus on stateful workspaces that last 5-15 rounds, with cumulative executable tests that must keep passing prior requirements while new ones are added. The MT@4 metric (four attempts, fail-stop across rounds) and the comparison to SR from a reference prior state make the maintenance gap visible in a way single-turn benchmarks do not. The tier-dependent failure patterns—weak agents drop early, stronger ones reach specification-tracking and regression issues—are also useful observations.

The paper does a straightforward job releasing the tasks, data, and Harbor infrastructure so others can inspect or extend it.

The soft spot is exactly the one the stress-test flags: representativeness of the 26 tasks and 227 rounds. No clear protocol for task selection, domain spread, or test-construction rules is visible, so it remains possible the observed gaps partly reflect how the tasks and tests were built rather than general agent behavior. That assumption carries the headline claims.

This is for researchers who evaluate or train coding agents and want evidence that multi-turn maintenance is a distinct problem. It deserves peer review because the core framing is sound and the empirical comparison is worth referee scrutiny, even if the task set needs tighter justification.

Referee Report

1 major / 0 minor

Summary. The paper introduces EvoCode-Bench, a new benchmark of 26 stateful coding tasks spanning 227 rounds, designed to evaluate coding agents on multi-turn iterative development where requirements evolve while preserving workspace state and using cumulative executable tests. It reports results on 13 agents using two metrics—MT@4 (four-attempt fail-stop multi-round success) and SR (single-round success from a reference state)—claiming that SR exceeds MT@4 by 22-40 points for most agents, that this gap reverses agent rankings (e.g., top SR agent ranks third on MT@4), that even the strongest agents reach only ~50% on multi-turn metrics, and that aggregate pass rates drop below half of round-1 performance by round 5. Failure analysis indicates tier-dependent patterns, with weaker agents failing early and stronger ones exposing specification-tracking and regression issues. The benchmark and Harbor infrastructure are released.

Significance. If the 26 tasks prove representative of real iterative development, the work provides a useful new evaluation framework that exposes limitations in current agents' ability to handle persistent state, evolving specifications, and regressions—limitations not captured by single-turn benchmarks. The release of data and infrastructure is a concrete strength that enables follow-on work.

major comments (1)

[Abstract and §3 (task/benchmark construction)] The central empirical claims (SR-MT@4 gaps of 22-40 points, ranking reversals, ~50% multi-turn success, pass-rate halving by round 5, and tier-dependent failure modes) rest on the assumption that the 26 tasks and their cumulative tests expose general failure modes rather than artifacts of task selection or test design. No section provides a task-selection protocol, domain coverage, complexity distribution, or construction details for the 227 rounds that would allow assessment of representativeness (e.g., whether tasks were chosen or tests written to highlight multi-turn fragility). This is load-bearing for generalizability.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for highlighting the importance of task construction details to support claims of generalizability. We address the major comment below and will revise the manuscript to incorporate additional transparency.

read point-by-point responses

Referee: [Abstract and §3 (task/benchmark construction)] The central empirical claims (SR-MT@4 gaps of 22-40 points, ranking reversals, ~50% multi-turn success, pass-rate halving by round 5, and tier-dependent failure modes) rest on the assumption that the 26 tasks and their cumulative tests expose general failure modes rather than artifacts of task selection or test design. No section provides a task-selection protocol, domain coverage, complexity distribution, or construction details for the 227 rounds that would allow assessment of representativeness (e.g., whether tasks were chosen or tests written to highlight multi-turn fragility). This is load-bearing for generalizability.

Authors: We agree that the current manuscript lacks sufficient detail on task selection and construction, which limits readers' ability to evaluate representativeness. In the revised version, we will expand §3 with a new subsection that includes: (1) the task-selection protocol (criteria for choosing the 26 tasks and ensuring stateful iterative development scenarios), (2) domain coverage and complexity distribution (e.g., programming domains, lines of code, number of requirements per task), and (3) construction details for the 227 rounds, including how cumulative tests were authored to check both new and prior requirements while minimizing design artifacts. These additions will directly address concerns about whether observed gaps and failure modes reflect general agent limitations or benchmark-specific choices. The released benchmark data will also be accompanied by this expanded documentation. revision: yes

Circularity Check

0 steps flagged

No circularity: direct empirical measurements on newly introduced benchmark

full rationale

The paper defines 26 new stateful tasks, 227 rounds, and two metrics (MT@4, SR) then reports raw success rates, ranking shifts, and failure patterns from running 13 agents. These are direct observations against the introduced corpus; no equations, fitted parameters, self-citations, or prior results are invoked to derive the headline numbers. The central claims reduce only to the benchmark execution itself, which is externally falsifiable by re-running the released tasks and agents.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No mathematical derivations or fitted parameters appear in the abstract; the contribution is empirical benchmark construction and evaluation.

pith-pipeline@v0.9.1-grok · 5768 in / 1064 out tokens · 27484 ms · 2026-06-30T16:19:31.184895+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

LemonHarness Technical Report
cs.AI 2026-06 unverdicted novelty 5.0

LemonHarness constrains LLM agent state changes to a defined workspace, supplies callable rule knowledge, and adds time awareness, yielding 84.49% and 86.52% accuracy on Terminal-Bench 2.0 with two GPT-5 backbones.

Reference graph

Works this paper leans on

9 extracted references · 3 canonical work pages · cited by 1 Pith paper · 1 internal anchor

[1]

Wai-Chung Kwan, Xingshan Zeng, Yuxin Jiang, Yufei Wang, Liangyou Li, Lifeng Shang, Xin Jiang, Qun Liu, and Kam-Fai Wong

URLhttps://openreview.net/forum?id=VTF8yNQM66. Wai-Chung Kwan, Xingshan Zeng, Yuxin Jiang, Yufei Wang, Liangyou Li, Lifeng Shang, Xin Jiang, Qun Liu, and Kam-Fai Wong. MT-eval: A multi-turn capabilities evaluation benchmark for large language models. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pp. 20153–20177,...

work page doi:10.18653/v1/2024.emnlp-main.1124 2024
[2]

URLhttps://arxiv.org/abs/2601.11868. OpenAI. Codex. Product documentation, 2026. URLhttps://openai.com/codex. Jiayi Pan, Xingyao Wang, Graham Neubig, Navdeep Jaitly, Heng Ji, Alane Suhr, and Yizhe Zhang. Training software engineering agents and verifiers with swe-gym. In Aarti Singh, Maryam Fazel, Daniel Hsu, Simon Lacoste-Julien, Felix Berkenkamp, Tegan ...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/2024.acl-long.850 2026
[3]

The highest number below 100 that does not contain the digit 9 is 95

OpenReview.net, 2025. URL https://proceedings.iclr.cc/paper_files/paper/2025/hash/ 63fef0802863f47775c3563e18cbba17-Abstract-Conference.html. Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Tianle Li, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zhuohan Li, Zi Lin, Eric P . Xing, Joseph E. Gonzalez, Ion Stoica, and Hao Zhang. LMSYS-chat-1m: A large-scale real-w...

work page arXiv 2025
[4]

No algorithm names, data structure choices, or file organization prescriptions appear in instructions

Describe behavior, not implementation.Instructions specify what the system should do in terms of observable behavior, not how to achieve it. No algorithm names, data structure choices, or file organization prescriptions appear in instructions. This lets different agents reach equivalent behavioral goals through different implementation paths
[5]

They never inspect source code structure, function names, or file organization

Test behavioral requirements, not implementation details.Verification scripts check the system’s external behavior through actual execution: running commands, sending inputs, and checking outputs. They never inspect source code structure, function names, or file organization. This makes the same test suite valid across arbitrary implementation paths
[6]

They can only assume the agent passed the behavioral tests of all prior rounds

Round-independent testing.Round i’s tests cannot assume the agent used the same code structure as the reference solution in rounds 1 throughi− 1. They can only assume the agent passed the behavioral tests of all prior rounds. This handles divergent implementation paths in multi-turn evaluation. These principles are enforced at every stage of the pipeline:...
[7]

They check for alternative valid inter- pretations that the tests might not distinguish and for edge cases that the cumulative tests might miss

Answer correctness verification.Independent reviewers verify that the reference solution for each round produces the behavior specified by the instructions. They check for alternative valid inter- pretations that the tests might not distinguish and for edge cases that the cumulative tests might miss
[8]

They also verify that no test checks stale requirements that have been superseded by later rounds, which would cause otherwise correct implementations to fail

Test-specification alignment audit.Reviewers verify that every test assertion traces back to a current specification, either newly introduced in the current round or persisting from a previous round. They also verify that no test checks stale requirements that have been superseded by later rounds, which would cause otherwise correct implementations to fail
[9]

Shortcut analysis.Reviewers attempt to identify implementation strategies that would pass all tests without satisfying the task requirements. This includes checking for overly specific test inputs that allow hardcoding, test orderings that leak information, and behavioral checks that are too loose to distinguish correct from incorrect implementations. Tas...

2025

[1] [1]

Wai-Chung Kwan, Xingshan Zeng, Yuxin Jiang, Yufei Wang, Liangyou Li, Lifeng Shang, Xin Jiang, Qun Liu, and Kam-Fai Wong

URLhttps://openreview.net/forum?id=VTF8yNQM66. Wai-Chung Kwan, Xingshan Zeng, Yuxin Jiang, Yufei Wang, Liangyou Li, Lifeng Shang, Xin Jiang, Qun Liu, and Kam-Fai Wong. MT-eval: A multi-turn capabilities evaluation benchmark for large language models. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pp. 20153–20177,...

work page doi:10.18653/v1/2024.emnlp-main.1124 2024

[2] [2]

URLhttps://arxiv.org/abs/2601.11868. OpenAI. Codex. Product documentation, 2026. URLhttps://openai.com/codex. Jiayi Pan, Xingyao Wang, Graham Neubig, Navdeep Jaitly, Heng Ji, Alane Suhr, and Yizhe Zhang. Training software engineering agents and verifiers with swe-gym. In Aarti Singh, Maryam Fazel, Daniel Hsu, Simon Lacoste-Julien, Felix Berkenkamp, Tegan ...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/2024.acl-long.850 2026

[3] [3]

The highest number below 100 that does not contain the digit 9 is 95

OpenReview.net, 2025. URL https://proceedings.iclr.cc/paper_files/paper/2025/hash/ 63fef0802863f47775c3563e18cbba17-Abstract-Conference.html. Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Tianle Li, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zhuohan Li, Zi Lin, Eric P . Xing, Joseph E. Gonzalez, Ion Stoica, and Hao Zhang. LMSYS-chat-1m: A large-scale real-w...

work page arXiv 2025

[4] [4]

No algorithm names, data structure choices, or file organization prescriptions appear in instructions

Describe behavior, not implementation.Instructions specify what the system should do in terms of observable behavior, not how to achieve it. No algorithm names, data structure choices, or file organization prescriptions appear in instructions. This lets different agents reach equivalent behavioral goals through different implementation paths

[5] [5]

They never inspect source code structure, function names, or file organization

Test behavioral requirements, not implementation details.Verification scripts check the system’s external behavior through actual execution: running commands, sending inputs, and checking outputs. They never inspect source code structure, function names, or file organization. This makes the same test suite valid across arbitrary implementation paths

[6] [6]

They can only assume the agent passed the behavioral tests of all prior rounds

Round-independent testing.Round i’s tests cannot assume the agent used the same code structure as the reference solution in rounds 1 throughi− 1. They can only assume the agent passed the behavioral tests of all prior rounds. This handles divergent implementation paths in multi-turn evaluation. These principles are enforced at every stage of the pipeline:...

[7] [7]

They check for alternative valid inter- pretations that the tests might not distinguish and for edge cases that the cumulative tests might miss

Answer correctness verification.Independent reviewers verify that the reference solution for each round produces the behavior specified by the instructions. They check for alternative valid inter- pretations that the tests might not distinguish and for edge cases that the cumulative tests might miss

[8] [8]

They also verify that no test checks stale requirements that have been superseded by later rounds, which would cause otherwise correct implementations to fail

Test-specification alignment audit.Reviewers verify that every test assertion traces back to a current specification, either newly introduced in the current round or persisting from a previous round. They also verify that no test checks stale requirements that have been superseded by later rounds, which would cause otherwise correct implementations to fail

[9] [9]

Shortcut analysis.Reviewers attempt to identify implementation strategies that would pass all tests without satisfying the task requirements. This includes checking for overly specific test inputs that allow hardcoding, test orderings that leak information, and behavioral checks that are too loose to distinguish correct from incorrect implementations. Tas...

2025