CodeClash: Benchmarking Goal-Oriented Software Engineering
Pith reviewed 2026-05-18 01:39 UTC · model grok-4.3
The pith
Language models lose every round to expert human programmers in goal-oriented code tournaments
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
CodeClash runs language models through multi-round tournaments in which agents edit codebases and then face off in code arenas that award wins according to competitive objectives. In 1680 tournaments and 25200 rounds, models display varied styles but consistently struggle with strategic reasoning and with preventing progressive messiness and redundancy in their code. The stark result is that top models lose every round against expert human programmers.
What carries the argument
Multi-round tournaments alternating between self-directed code editing phases and head-to-head competitions in objective-based code arenas
Load-bearing premise
The chosen competitive objectives and arena rules serve as a valid proxy for real-world high-level software engineering goals that lack explicit step-by-step guidance.
What would settle it
Running the same CodeClash tournaments and finding that at least one top model wins any round against the expert human programmers would directly test the central claim of stark limitations.
Figures
read the original abstract
Current benchmarks for coding evaluate language models (LMs) on concrete, well-specified tasks such as fixing specific bugs or writing targeted tests. However, human programmers do not spend all day incessantly addressing isolated tasks. Instead, real-world software development is grounded in the pursuit of high-level goals, like improving user retention or reducing costs. Evaluating whether LMs can also iteratively develop code to better accomplish open-ended objectives without any explicit guidance remains an open challenge. To address this, we introduce CodeClash, a benchmark where LMs compete in multi-round tournaments to build the best codebase for achieving a competitive objective. Each round proceeds in two phases: agents edit their code, then their codebases compete head-to-head in a code arena that determines winners based on objectives like score maximization, resource acquisition, or survival. Whether it's writing notes, scrutinizing documentation, analyzing competition logs, or creating test suites, models must decide for themselves how to improve their codebases both absolutely and against their opponents. We run 1680 tournaments (25,200 rounds total) to evaluate 8 LMs across 6 arenas. Our results reveal that while models exhibit diverse development styles, they share fundamental limitations in strategic reasoning. Models also struggle with long-term codebase maintenance, as repositories become progressively messy and redundant. These limitations are stark: top models lose every round against expert human programmers. We open-source CodeClash to advance the study of autonomous, goal-oriented code development.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces CodeClash, a benchmark in which language models compete in multi-round tournaments to iteratively develop codebases that achieve open-ended competitive objectives (e.g., score maximization, resource acquisition, survival) across six arenas. Agents edit code and then compete head-to-head in an arena evaluator; the study runs 1680 tournaments (25,200 rounds total) on eight LMs, reports diverse development styles together with limitations in strategic reasoning and long-term maintenance, and states that top models lose every round to expert human programmers.
Significance. If the central empirical claims hold under equivalent conditions, the work supplies a useful step beyond isolated-task coding benchmarks toward evaluating autonomous, goal-directed software engineering. The scale of 1680 tournaments and 25,200 rounds supplies substantial empirical coverage, and the open-sourcing of the benchmark supports reproducibility and follow-on research.
major comments (1)
- [Abstract] Abstract: the claim that 'top models lose every round against expert human programmers' is load-bearing for the paper's conclusions on intrinsic model limitations, yet the human participation protocol (code-editing interface, per-round time budgets, access to competition logs, and external tooling) is not specified in parallel with the model-agent description. Without this, the performance gap cannot be unambiguously attributed to strategic or maintenance shortcomings rather than setup asymmetry.
minor comments (1)
- [Abstract] Abstract: implementation details of arena scoring mechanics and how winners are determined from objectives are left unspecified, limiting assessment of whether the competitive proxy faithfully captures the intended high-level goals.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address the major comment regarding the human participation protocol below and will incorporate the requested details into the revised manuscript.
read point-by-point responses
-
Referee: [Abstract] Abstract: the claim that 'top models lose every round against expert human programmers' is load-bearing for the paper's conclusions on intrinsic model limitations, yet the human participation protocol (code-editing interface, per-round time budgets, access to competition logs, and external tooling) is not specified in parallel with the model-agent description. Without this, the performance gap cannot be unambiguously attributed to strategic or maintenance shortcomings rather than setup asymmetry.
Authors: We agree that the human baseline protocol requires explicit, parallel specification to support the claim and to enable readers to evaluate whether the performance gap stems from model limitations or experimental asymmetry. In the revised manuscript we will add a dedicated subsection to the Experimental Setup (Section 4) that mirrors the model-agent description. This subsection will detail: the code-editing interface (a browser-based IDE with file tree navigation, syntax highlighting, and in-place editing, identical in functionality to the agent environment); per-round time budgets (20 minutes of active editing time plus 5 minutes for review and submission, calibrated to exceed typical model inference latency); access to competition logs (full round histories, opponent codebases, and arena evaluation outputs provided at the start of each editing phase); and external tooling (standard language documentation, local test runners, and basic IDE features, with explicit prohibition of external AI assistants). Humans received the same high-level objective statements as the agents and no additional strategic guidance. These additions will be placed immediately after the model-agent protocol description to facilitate direct comparison. We believe the revision will strengthen the attribution of observed limitations while preserving the empirical findings. revision: yes
Circularity Check
No circularity: empirical benchmark results are self-contained
full rationale
The paper introduces CodeClash as a new benchmark and reports direct empirical outcomes from 1680 tournaments (25,200 rounds) evaluating 8 LMs against each other and expert humans across 6 arenas. No mathematical derivations, equations, fitted parameters presented as predictions, or load-bearing self-citations exist. Central claims rest on observable competition results in code arenas rather than any reduction to inputs by construction, making the evaluation independent and self-contained.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The selected objectives (score maximization, resource acquisition, survival) and arena rules constitute representative tests of goal-oriented software engineering without explicit guidance.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We introduce CodeClash, a benchmark where LMs compete in multi-round tournaments to build the best codebase for achieving a competitive objective. Each round proceeds in two phases: agents edit their code, then their codebases compete head-to-head in a code arena...
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Models also struggle with long-term codebase maintenance, as repositories become progressively messy and redundant.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
URL https://arxiv.org/abs/2310.06770. D.G. Jones and A.K. Dewdney. Core wars guidelines, 1984. URL https://corewar.co.uk/ standards/cwg.txt. Seth Karten, Andy Luu Nguyen, and Chi Jin. Pok´echamp: an expert-level minimax language agent, 2025. URL https://arxiv.org/abs/2503.04094. Bhavesh Kumar, Hoang Nguyen, and Roger Jin. Husky hold’em bench. https:// hus...
work page internal anchor Pith review Pith/arXiv arXiv 1984
-
[2]
LMs should be able to view execution feedback. Execution is crucial to enable models to create and use their own constructs (e.g., analysis scripts, memory systems)
-
[3]
A defining challenge of CodeClash is that LMs operate in a self-directed manner
LMs should be able to interact with a codebase. A defining challenge of CodeClash is that LMs operate in a self-directed manner. Workflow-oriented approaches (Xia et al., 2024) are unsuitable for our setting. Going hand-in-hand with (1), interaction is also necessary so that models can string sequences of changes together. 16 CodeClash: Benchmarking Goal-...
work page 2024
-
[4]
LMs should operate using bash actions, not tools. As described in Yang et al. (2024b), various workflows and tools can be (un-)intentionally biased to favor particular models. Our goal is to evaluate models, not scaffolds or tools. Therefore, we decide to make LMs operate in the most “impartial” action space. This decision also leaves an opportunity for L...
-
[5]
You write a single bash command
-
[6]
The system executes that command in a subshell
-
[7]
You write your next command For each of your response:
-
[8]
Include a THOUGHT section explaining your reasoning and what you’re trying to accomplish
-
[9]
Provide exactly ONE bash command to execute
-
[10]
The action must be enclosed in triple backticks (see below for formatting rules)
-
[11]
Every ac- tion is executed in a new subshell
Directory or environment variable changes are not persistent. Every ac- tion is executed in a new subshell. However, you can prefix any action with MY ENV VAR=MY VALUE cd /path/to/working/dir && ... or write/load environment variables from files Format your responses like this: <format example> THOUGHT: Here I explain my reasoning process, analysis of the...
work page 2024
-
[12]
The model is the only one with a valid submission (for example because the other model’s submission does not compile or execute)
-
[13]
I have made all the changes I think are necessary. I will now conclude this round [END action]
The model scores higher than all others. Scores a typically either win rates (across all repetitions of the arena), or other aggregate quantities (e.g., total amount of money won in poker). Distributions of round scores for different arenas are shown in Figure 24. Because of the sequential nature of a tournament, the scores of the rounds are not independe...
work page 1952
-
[14]
We find that malformed actions does not constitute a significant reason for why mod- els might struggle in CodeClash. 1 2 3 4 Recovery Time (Steps) 0.0 0.2 0.4 0.6 0.8 1.0 P(Recovery takes > X steps) Claude Sonnet 4.5 Qwen3 Coder o3 Gemini 2.5 Pro GPT-5 GPT-5 Mini Grok Code Fast Claude Sonnet 4 Figure 38: “Recovery time” is the num- ber of steps between a...
work page 2025
-
[15]
What motivated the edits
-
[16]
What steps were taken to validate the edits All questions that are marked as boolean need to be answered with a boolean value . You cannot answer " unknown " or similar . ## Definitions ** Main player file **: You are investigating an LM agent that is playing a game . The main player file is the main file that constitutes the agent 's submission , i . e ....
-
[17]
Only comments , documentation , refactoring was performed
`none `: No change in behavior . Only comments , documentation , refactoring was performed
-
[18]
`tweak `: Logic is left unchanged , but we do change some parameters
-
[19]
`fix `: Small , targeted change with the intent to fix broken behavior
-
[20]
` feature `: Significant new behavior is added , mostly extending the existing code
-
[21]
`change `: We significantly change the behavior by rewriting significant logic of the code . Notes :
-
[22]
Only count the final edits to the main player file ( any edits that are reverted are not counted )
-
[23]
For this question , only the main player file is considered
-
[24]
For feature or change , the order is not important , choose what better describes the changes
Precedence if multiple categories might fit : `none ` < `tweak ` < `fix ` < ` feature ` or `change `. For feature or change , the order is not important , choose what better describes the changes
-
[25]
Ignore comments , documentation , or refactorings that do not change behavior . ## Q2 ( ` edits_motivated_by_logs `, boolean ) : Are the final edits to the main player file motivated by previous round ' s logs ? 51 CodeClash: Benchmarking Goal-Oriented Software Engineering Are the ** FINAL ** (!) edits to the ** MAIN PLAYER FILE (!) ** of the player direc...
-
[26]
Note that the failure mode need not be spelled out in any of the action outputs
A failure mode can be inferred with the help of reading the logs or analysis scripts evaluating the logs . Note that the failure mode need not be spelled out in any of the action outputs . It is enough that there is enough information to infer a failure mode based on basic reasoning
-
[27]
It is ok if some minor parts of the edit are unrelated
The edit is directly related to this failure mode . It is ok if some minor parts of the edit are unrelated . The logs can be either from a game that the player simulates itself , or from the previous round , but it must be a meaningful game log . Here are some examples of real failure modes : - The snake that the player is controlling runs out of food ( s...
-
[28]
Player does not look at logs
-
[29]
Player reads some lines of the logs , but no clear failure mode is inferable . For example , the lines only state some game state , but it is not clear what is going wrong , for example because only the first lines of the game log are shown without showing the conclusion . Or the logs only show which player won but without much of a reason
-
[30]
For example , the analysis script only reports losses , without attribution of what went wrong
Player runs a script that analyzes logs , but the analysis script does not return an actionable outcome or information that allows to infer it . For example , the analysis script only reports losses , without attribution of what went wrong
-
[31]
A clear failure mode is uncovered in some of the logs or analyses , but the edits do not seem to be correlated to this failure mode . ## Q3 ( ` edits_motivated_by_insights `) : Are the final edits to the main player file motivated by insights ? Can the goal of the ** FINAL ** (!) edits to the ** MAIN PLAYER FILE (!) ** be motivated by any insights based o...
-
[32]
The player wrote a meaningful test that revealed a problem ( or a way to improve ) and then performed the corresponding edit
-
[33]
The player wrote a meaningful analysis script that revealed a problem ( or a way to improve ) and then performed the corresponding edit
-
[34]
The player ran some test games that revealed a problem ( or a way to improve ) and then performed the corresponding edit
-
[35]
The player made some changes , and then ran test games against the previous version and verified that the changes improved the performance , i . e . , had a higher win rate . However , if for 1. and 2. the test or analysis script gives a recommendation that 's not corroborated by the actual code of the analysis or test file , or by its respective output ,...
-
[36]
Old : Were not created during the trajectory , i . e . , you do not see how they were created
-
[37]
A common case is generic notes in ` README_agent
Static : Are always shown and do not depend on any tests or analysis outcomes . A common case is generic notes in ` README_agent . md ` or similar documentation proposing ways to improve the bot in the next round . This question is independent of the previous questions ( ` edits_motivated_by_logs `, ` edits_motivated_by_insights `) : The final edits can b...
-
[38]
Unit tests showed that the edits introduced issues
-
[39]
Simulations showed that the edits introduced issues or had a lower win rate Do not consider edits that failed because of incorrect usage of the edit tools or other problems that caused the edits to not take effect at all . ## Q6 ( ` edits_tested_with_simulations `) : Are the final edits to the main player file tested with simulations of the game ? Are the...
-
[40]
If the games failed to run , or showed that the new version was clearly worse than the previous version , answer False
-
[41]
If it was not verified who won the games , also answer False
-
[42]
Unit tests do NOT (!) count as a simulated game
-
[43]
The validation by simulation does not have to take place at the very end , but it has to be played with the updated version of the main player file that includes the 53 CodeClash: Benchmarking Goal-Oriented Software Engineering core implementation of the idea of the final edits . It is acceptable to have some minor edits performed after the simulation , a...
-
[44]
Running the game to get a win rate does not count as a unittest , because it does not specifically validate specific changes
-
[45]
Running unittests that are unrelated to the changes does not count either
-
[46]
If the tests did not run , or showed that the new version was broken , answer False
-
[47]
You can also count tests that only print output ( but do not have assert statements ) as unit tests , if they essentially print the expected output of the new or modified behavior and can therefore be used to validate the new or modified behavior
-
[48]
The validation by unittests does not have to take place at the very end , but it has to be performed with the updated version of the main player file that includes the core implementation of the idea of the final edits . It is acceptable to have some minor edits performed after the unittests , as long as the core idea of the final edits is included . Spec...
-
[49]
An additional test was added to a test script or unittest framework
-
[50]
The analysis script was improved to look for a new behavior or failure mode
-
[51]
A script to help running simulated games and to parse the results The following are examples of non - significant improvements :
-
[52]
Static messages or comments are added to the test or analysis framework ( e . g . , generic improvement notes that are independent of actual observations )
-
[53]
Documentation of the tests or analysis scripts
-
[54]
Analysis or test scripts that are specific to the current round and are not expected to be useful for the next round . Notes :
-
[55]
If a test or analysis is executed without being saved to disk , it does not count as an improvement ( i . e . , ` python -c ` calls , shell one - liners , etc .)
-
[56]
If a test or analysis script is removed after being executed , it does not count
-
[57]
This question is completely independent of the main player file and all other questions . ## Output format Answer in the json format specified . The ` reasoning ` field should contain an explanation for your answer that explains your reasoning for each of the answers . Include general statements / observations first , then write down your reasoning for ea...
-
[58]
There is the following bug in the code
The thought is not framed as a hypothesis , but rather as a statement of fact . For example " There is the following bug in the code " or " We can improve the code by doing X " , etc . Do not include thoughts that are framed as future actions , e . g . , " I will now do X "
-
[59]
The statement of fact is concrete
-
[60]
The statement of fact in the thought cannot be corroborated by the information that the agent has access to at step i
-
[61]
The agent also cannot come to the conclusion by common sense knowledge and reasoning about the information that the agent has access to at step i
-
[62]
The agent would have had the means of obtaining the information in principle ( analyzing logs , reading source code , executing tests , etc .)
-
[63]
There is the following bug in the code
The incident , i . e . , the uncorroborated and potentially incorrect statement of fact is relevant to the overall trajectory and the objective of the agent , i . e . , the final goal of the agent winning the game . In other words , the potentially incorrect statement of fact might have reduced the agent 's chances of winning the game . ### Examples of th...
-
[64]
You MUST (!) categorize EVERY (!) action . Do NOT (!) skip any action
-
[65]
Every action MUST (!) be put into exactly (!) one (!) category
-
[66]
Your category MUST (!) be one of the list above
-
[67]
If you are unsure , use the best match for the category . In Figure 46, read combines the navigation, search, and read operations. Claude Sonnet 4.5 loses to a static solution written by a human expert. As discussed in Section 4.1, we run 10 tournaments of Claude Sonnet 4.5 , the top model on the RobotRum- ble arena, against the top open-source submission...
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.