CodeClash: Benchmarking Goal-Oriented Software Engineering

Aryan Siddiqui; Carlos E. Jimenez; Diyi Yang; John Yang; Joyce Yang; Kilian Lieret; Ludwig Schmidt; Muhtasham Oblokulov; Ofir Press

arxiv: 2511.00839 · v2 · pith:RP3SGZIJnew · submitted 2025-11-02 · 💻 cs.SE · cs.AI

CodeClash: Benchmarking Goal-Oriented Software Engineering

John Yang , Kilian Lieret , Joyce Yang , Carlos E. Jimenez , Muhtasham Oblokulov , Aryan Siddiqui , Ofir Press , Ludwig Schmidt

show 1 more author

Diyi Yang

This is my paper

Pith reviewed 2026-05-18 01:39 UTC · model grok-4.3

classification 💻 cs.SE cs.AI

keywords CodeClashlanguage modelssoftware engineering benchmarkgoal-oriented codingstrategic reasoningcodebase maintenancemulti-round tournamentscompetitive arenas

0 comments

The pith

Language models lose every round to expert human programmers in goal-oriented code tournaments

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces CodeClash to test whether language models can iteratively develop codebases toward open-ended competitive objectives without explicit step-by-step instructions. Models edit their code in rounds and then compete head-to-head in arenas scored on goals such as score maximization, resource acquisition, or survival. Evaluation across 1680 tournaments with eight models shows diverse development approaches yet shared weaknesses in strategic reasoning and long-term codebase upkeep, as repositories grow messy and redundant. Top models are defeated in every round by expert humans. This setup is meant to reflect real software engineering more closely than benchmarks limited to isolated tasks.

Core claim

CodeClash runs language models through multi-round tournaments in which agents edit codebases and then face off in code arenas that award wins according to competitive objectives. In 1680 tournaments and 25200 rounds, models display varied styles but consistently struggle with strategic reasoning and with preventing progressive messiness and redundancy in their code. The stark result is that top models lose every round against expert human programmers.

What carries the argument

Multi-round tournaments alternating between self-directed code editing phases and head-to-head competitions in objective-based code arenas

Load-bearing premise

The chosen competitive objectives and arena rules serve as a valid proxy for real-world high-level software engineering goals that lack explicit step-by-step guidance.

What would settle it

Running the same CodeClash tournaments and finding that at least one top model wins any round against the expert human programmers would directly test the central claim of stark limitations.

Figures

Figures reproduced from arXiv: 2511.00839 by Aryan Siddiqui, Carlos E. Jimenez, Diyi Yang, John Yang, Joyce Yang, Kilian Lieret, Ludwig Schmidt, Muhtasham Oblokulov, Ofir Press.

**Figure 3.** Figure 3: Win rates across rounds, illustrating how different models gain (Claude Sonnet 4.5) or lose momentum (GPT-5) over the course of the tournament. 4.1 Ablations On RobotRumble, models trail substantially behind expert human programmers. From RobotRumble’s leaderboard3 , we identified the top open-source submission as of October 31, 2025, a bot called gigachad authored by entropicdrifter4 . We run 10 tourname… view at source ↗

**Figure 4.** Figure 4: Probability of winning the next round after losing several rounds in a row. Even the highest ranking models struggle to recover after losing one or more consecutive rounds in a tournament. Numbers in parentheses indicate the overall average win rate. 1 5 10 15 Round 0.2 0.3 0.4 0.5 0.6 Mean Code Similarity Claude Sonnet 4 Claude Sonnet 4.5 Gemini 2.5 Pro GPT-5 GPT-5 Mini Grok Code Fast o3 Qwen3 Coder [PI… view at source ↗

**Figure 6.** Figure 6: The total number of created files scales almost linear with the round. R refers to the filename redundancy at round 15; high values indicate repeating patterns in filenames (such as main1.py, main2.py, . . . ). 0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0 Throwaway Files per Tournament Qwen3 Coder o3 Grok Code Fast GPT-5 Mini GPT-5 Gemini 2.5 Pro Claude Sonnet 4.5 Claude Sonnet 4 5.1 2.4 7.5 1.3 0.8 2.9 3.2 1… view at source ↗

**Figure 8.** Figure 8: LMs struggle to analyze log files from previous rounds and frequently hallucinate [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗

**Figure 9.** Figure 9: Technical overview of a CodeClash round. Each round, during the [PITH_FULL_IMAGE:figures/full_fig_p016_9.png] view at source ↗

**Figure 10.** Figure 10: Battlecode 2025: Chromatic Conflict screen capture. The goal is to control a team of robobunnies to paint 70% of a map. 1 import random 2 from battlecode25 . stubs import * 3 turn_count , directions = 0 , [ # 8 directions ] 4 5 def turn () : 6 # MUST be defined . This is called every turn and should contain core logic 7 8 def run_tower () : 9 # Logic for a tower unit . 10 11 def run_soldier () : 12 # Log… view at source ↗

**Figure 12.** Figure 12: Battlesnake screen capture. Your code controls a snake that should find food, avoid other snakes, and survive. 1 def info () : 2 return {" author ": "", " color ": " #888888 " ...} 3 4 def start ( game_state ) : 5 ... 6 7 def end ( game_state ) : 8 ... 9 10 def move ( game_state ) : 11 # determine safe move ; prevent moving backwards , out of bounds , or into self / others ; optionally move toward food 12… view at source ↗

**Figure 15.** Figure 15: This Core War program, called Dwarf, is a minimal attacking warrior. It repeatedly increments the pointer bmb (add.ab #4, bmb), copies the dat instruction to that location (mov.i bmb, bmb), and then loops back (jmp start). The effect is that every fourth memory cell in the core is overwritten with a dat “bomb”, gradually scattering lethal instructions that kills an opponent’s processes if it is executed… view at source ↗

**Figure 17.** Figure 17: Example Halite bot implementation in C. Bots follow a game loop structure: receive the current game state (GetFrame), iterate over owned cells to decide moves, and submit actions (SendFrame). What are effective strategies? Effective strategies in Halite span three distinct phases. During the early game up until the bot makes contact with an opponent, an effective strategy is to capture neutral territory … view at source ↗

**Figure 19.** Figure 19: A poker bot subclasses Bot and implements lifecycle hooks. These functions define how the bot initializes, chooses actions during play, and responds at the end of each round and game. Isn’t poker solved already? Poker has served as a long standing sandbox for researching superhuman level AI systems. Simple, constrained variants of poker, such as Heads-Up [No-]Limit Texas Hold’em (2 players, fixed bet siz… view at source ↗

**Figure 20.** Figure 20: RoboCode screen capture. Your code controls a tank that should outmaneuver and outgun opposing tanks. 1 package custom ; 2 3 import robocode . Robot ; 4 import robocode . ScannedRobotEvent ; 5 6 public class MyTank extends Robot { 7 public void run () { 8 // main loop : move + scan 9 ... 10 } 11 12 public void onScannedRobot ( ScannedRobotEvent e ) { 13 // respond to scanned robot 14 ... 15 } 16 } [PITH… view at source ↗

**Figure 22.** Figure 22: RobotRumble screen capture. Your code controls a tank that should outmaneuver and outgun opposing tanks. 1 def robot ( state , unit ) : 2 # Decide what this unit should do on its turn . 3 # Possible actions include : 4 # - Moving in one of the cardinal directions 5 # - Attacking in a direction 6 # - Gathering or interacting with resources 7 # - Defending or waiting (no -op) 8 # The decision can depend on… view at source ↗

**Figure 24.** Figure 24: Distribution of rounds scores by game. 1. The model is the only one with a valid submission (for example because the other model’s submission does not compile or execute) 2. The model scores higher than all others. Scores a typically either win rates (across all repetitions of the arena), or other aggregate quantities (e.g., total amount of money won in poker). Distributions of round scores for different … view at source ↗

**Figure 25.** Figure 25: Distribution of the number of rounds won by the players across arenas. The [PITH_FULL_IMAGE:figures/full_fig_p038_25.png] view at source ↗

**Figure 26.** Figure 26: Log likelihood profiles for a fit to all arenas results. [PITH_FULL_IMAGE:figures/full_fig_p039_26.png] view at source ↗

**Figure 27.** Figure 27: Distribution of Elo scores from non-parametric and parametric bootstrapping [PITH_FULL_IMAGE:figures/full_fig_p041_27.png] view at source ↗

**Figure 28.** Figure 28: Elo-based ranks from non-parametric and parametric bootstrapping [PITH_FULL_IMAGE:figures/full_fig_p041_28.png] view at source ↗

**Figure 29.** Figure 29: CDF of files edited per round by each model. While some models typically never edit more than 5 files (o3, Gemini 2.5 Pro), others tend to create and manipulate many more (Claude Sonnet 4.5, GPT-5) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Round 0 100 200 300 400 500 Average Lines Changed [PITH_FULL_IMAGE:figures/full_fig_p042_29.png] view at source ↗

**Figure 31.** Figure 31: Average lines changed per round per model for the README agent.md, a file we suggest agents write important information to. The Anthropic family of models write copious amounts of notes – other models tend to add more brief summaries. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Round 0 50 100 150 200 250 300 350 400 [PITH_FULL_IMAGE:figures/full_fig_p042_31.png] view at source ↗

**Figure 33.** Figure 33: CDF of number of steps taken per round per model. The Anthropic family of models along with Qwen3-Coder usually consumes more of the allotted step budget. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Round 5 10 15 20 25 30 [PITH_FULL_IMAGE:figures/full_fig_p043_33.png] view at source ↗

**Figure 35.** Figure 35: CDF of thought length (in words) per model. The thought lengths are computed per model response. Our calculation does not consider the action produced by the model within the same response. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Round 0 20 40 60 80 100 [PITH_FULL_IMAGE:figures/full_fig_p044_35.png] view at source ↗

**Figure 37.** Figure 37: A heatmap of errant action rates for models in different arenas. “Errant” means the action resulted in returncode == 0. We find that malformed actions does not constitute a significant reason for why models might struggle in CodeClash. 1 2 3 4 Recovery Time (Steps) 0.0 0.2 0.4 0.6 0.8 1.0 P(Recovery takes > X steps) Claude Sonnet 4.5 Qwen3 Coder o3 Gemini 2.5 Pro GPT-5 GPT-5 Mini Grok Code Fast Claude So… view at source ↗

**Figure 39.** Figure 39: Lead change rate comparison. A “lead change” is defined as a round [PITH_FULL_IMAGE:figures/full_fig_p045_39.png] view at source ↗

**Figure 40.** Figure 40: Win share comparison. We define “‘win share” as the percentage of total points [PITH_FULL_IMAGE:figures/full_fig_p045_40.png] view at source ↗

**Figure 42.** Figure 42: TrueSkill ratings per model based on 20 tournaments of 6-player Core War. TrueSkill models each player’s skill as a Gaussian distribution with mean µ (skill estimate) and standard deviation σ (uncertainty). After each round, both parameters are updated based on match outcomes: winning increases µ while exceeding expectations, and σ decreases as the system gains confidence in the estimate. Final placeme… view at source ↗

**Figure 43.** Figure 43: Results for the groundedness of edits, hallucinated loss causality, and validation [PITH_FULL_IMAGE:figures/full_fig_p047_43.png] view at source ↗

**Figure 44.** Figure 44: Results for the groundedness of edits, hallucinated loss causality, and validation [PITH_FULL_IMAGE:figures/full_fig_p048_44.png] view at source ↗

**Figure 45.** Figure 45: Models perform different kinds of edits on the main player file as the tournament [PITH_FULL_IMAGE:figures/full_fig_p049_45.png] view at source ↗

**Figure 46.** Figure 46: What do models spend their turns on? The mean number of actions a model [PITH_FULL_IMAGE:figures/full_fig_p050_46.png] view at source ↗

**Figure 47.** Figure 47: RobotRumble leaderboard screen capture as of October 31, 2025. We evaluate [PITH_FULL_IMAGE:figures/full_fig_p061_47.png] view at source ↗

**Figure 48.** Figure 48: Code similarity of models’ codebases with respect to each opponent for round 1 of BattleSnake (10 samples each). Claude Sonnet 4 Claude Sonnet 4.5 Gemini 2.5 Pro GPT-5 GPT-5 Mini Grok Code Fast o3 Qwen3 Coder Claude Sonnet 4 Claude Sonnet 4.5 Gemini 2.5 Pro GPT-5 GPT-5 Mini Grok Code Fast o3 Qwen3 Coder 0.23 0.27 0.21 0.25 0.19 0.20 0.19 0.32 0.26 0.27 0.32 0.29 0.40 0.31 0.19 0.26 0.26 0.23 0.23 0.21 0.… view at source ↗

**Figure 50.** Figure 50: Scatter plot of file reuse ratio and root level clutter with error bars. The top left quadrant represents most desirable practices (high file reuse, low root level clutter). 2 4 6 8 10 12 14 Round 0.0 0.1 0.2 0.3 0.4 0.5 0.6 Filename Redundancy Ratio Claude Sonnet 4 Claude Sonnet 4.5 Gemini 2.5 Pro GPT-5 GPT-5 Mini Grok Code Fast o3 Qwen3 Coder [PITH_FULL_IMAGE:figures/full_fig_p063_50.png] view at source ↗

**Figure 52.** Figure 52: Cumulative probability density function of the number of files created during a tournament. While Claude Sonnet 4.5 consistently creates more files than the other models, GPT-5 reaches a high average number of created files because of an extreme number of output files in the CoreWar arena that are not cleaned up. As discussed in the main results, we notice that codebases tend to follow this trend of cre… view at source ↗

**Figure 53.** Figure 53: Screenshot of the 52 files created by Claude 4.5 Sonnet by the 15th round of a BattleSnake tournament. Several files are created for the purpose of notes, analyses, unit testing, and backups of the main bot. 4.5 creates 13 files with the prefix “analyze ”. From manual inspection, we found that most of these implementations are doing the same thing, with only the log file path being different. The same tre… view at source ↗

read the original abstract

Current benchmarks for coding evaluate language models (LMs) on concrete, well-specified tasks such as fixing specific bugs or writing targeted tests. However, human programmers do not spend all day incessantly addressing isolated tasks. Instead, real-world software development is grounded in the pursuit of high-level goals, like improving user retention or reducing costs. Evaluating whether LMs can also iteratively develop code to better accomplish open-ended objectives without any explicit guidance remains an open challenge. To address this, we introduce CodeClash, a benchmark where LMs compete in multi-round tournaments to build the best codebase for achieving a competitive objective. Each round proceeds in two phases: agents edit their code, then their codebases compete head-to-head in a code arena that determines winners based on objectives like score maximization, resource acquisition, or survival. Whether it's writing notes, scrutinizing documentation, analyzing competition logs, or creating test suites, models must decide for themselves how to improve their codebases both absolutely and against their opponents. We run 1680 tournaments (25,200 rounds total) to evaluate 8 LMs across 6 arenas. Our results reveal that while models exhibit diverse development styles, they share fundamental limitations in strategic reasoning. Models also struggle with long-term codebase maintenance, as repositories become progressively messy and redundant. These limitations are stark: top models lose every round against expert human programmers. We open-source CodeClash to advance the study of autonomous, goal-oriented code development.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CodeClash sets up multi-round code tournaments for open-ended goals and flags model weaknesses in strategy and maintenance, but the human baseline comparison lacks enough protocol detail to fully support the strongest claims.

read the letter

The main thing to know is that this paper introduces CodeClash as a benchmark where models run multi-round tournaments, editing codebases to pursue competitive objectives like score maximization or survival in head-to-head arenas. This shifts from the usual single-task coding tests toward something closer to real goal-driven development without step-by-step instructions. They ran 1680 tournaments and 25200 rounds across eight models and six arenas, which gives decent empirical coverage of different development styles and shared problems like weak strategic reasoning and progressive codebase messiness. That part lands as useful new data on how models handle long-term maintenance in competitive settings. The open-sourcing of the benchmark is also a plus for anyone who wants to build on it or check the arenas themselves. The human comparison stands out as the starkest result, with top models losing every round to expert programmers. This highlights potential limits in autonomous code work, but the abstract and setup description do not spell out the human protocol in parallel terms. It is not clear whether humans used the same editing interface, had identical time budgets per round, or accessed the same competition logs. If the conditions differed, the gap could reflect setup asymmetry more than intrinsic model shortcomings in strategy or maintenance. That is a real soft spot for the central claim, though not necessarily fatal if the full paper clarifies matching conditions. The work targets researchers working on AI agents for software engineering who need benchmarks beyond scripted tasks. Readers interested in agentic systems and long-horizon code evolution will get the most from the tournament format and the observed failure modes. The empirical approach avoids circularity and sticks to direct competitions rather than fitted parameters. I would send this to peer review. The scale and the new evaluation angle make it worth referee time, with the main request being tighter specification of the human arm so the performance gap can be interpreted cleanly.

Referee Report

1 major / 1 minor

Summary. The paper introduces CodeClash, a benchmark in which language models compete in multi-round tournaments to iteratively develop codebases that achieve open-ended competitive objectives (e.g., score maximization, resource acquisition, survival) across six arenas. Agents edit code and then compete head-to-head in an arena evaluator; the study runs 1680 tournaments (25,200 rounds total) on eight LMs, reports diverse development styles together with limitations in strategic reasoning and long-term maintenance, and states that top models lose every round to expert human programmers.

Significance. If the central empirical claims hold under equivalent conditions, the work supplies a useful step beyond isolated-task coding benchmarks toward evaluating autonomous, goal-directed software engineering. The scale of 1680 tournaments and 25,200 rounds supplies substantial empirical coverage, and the open-sourcing of the benchmark supports reproducibility and follow-on research.

major comments (1)

[Abstract] Abstract: the claim that 'top models lose every round against expert human programmers' is load-bearing for the paper's conclusions on intrinsic model limitations, yet the human participation protocol (code-editing interface, per-round time budgets, access to competition logs, and external tooling) is not specified in parallel with the model-agent description. Without this, the performance gap cannot be unambiguously attributed to strategic or maintenance shortcomings rather than setup asymmetry.

minor comments (1)

[Abstract] Abstract: implementation details of arena scoring mechanics and how winners are determined from objectives are left unspecified, limiting assessment of whether the competitive proxy faithfully captures the intended high-level goals.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the major comment regarding the human participation protocol below and will incorporate the requested details into the revised manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that 'top models lose every round against expert human programmers' is load-bearing for the paper's conclusions on intrinsic model limitations, yet the human participation protocol (code-editing interface, per-round time budgets, access to competition logs, and external tooling) is not specified in parallel with the model-agent description. Without this, the performance gap cannot be unambiguously attributed to strategic or maintenance shortcomings rather than setup asymmetry.

Authors: We agree that the human baseline protocol requires explicit, parallel specification to support the claim and to enable readers to evaluate whether the performance gap stems from model limitations or experimental asymmetry. In the revised manuscript we will add a dedicated subsection to the Experimental Setup (Section 4) that mirrors the model-agent description. This subsection will detail: the code-editing interface (a browser-based IDE with file tree navigation, syntax highlighting, and in-place editing, identical in functionality to the agent environment); per-round time budgets (20 minutes of active editing time plus 5 minutes for review and submission, calibrated to exceed typical model inference latency); access to competition logs (full round histories, opponent codebases, and arena evaluation outputs provided at the start of each editing phase); and external tooling (standard language documentation, local test runners, and basic IDE features, with explicit prohibition of external AI assistants). Humans received the same high-level objective statements as the agents and no additional strategic guidance. These additions will be placed immediately after the model-agent protocol description to facilitate direct comparison. We believe the revision will strengthen the attribution of observed limitations while preserving the empirical findings. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark results are self-contained

full rationale

The paper introduces CodeClash as a new benchmark and reports direct empirical outcomes from 1680 tournaments (25,200 rounds) evaluating 8 LMs against each other and expert humans across 6 arenas. No mathematical derivations, equations, fitted parameters presented as predictions, or load-bearing self-citations exist. Central claims rest on observable competition results in code arenas rather than any reduction to inputs by construction, making the evaluation independent and self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper contributes a new empirical benchmark and observations from running it; it introduces no new mathematical entities or heavily fitted parameters beyond standard experimental choices such as tournament count and arena definitions.

axioms (1)

domain assumption The selected objectives (score maximization, resource acquisition, survival) and arena rules constitute representative tests of goal-oriented software engineering without explicit guidance.
This premise defines the evaluation arenas and is invoked to interpret model performance as evidence of real-world limitations.

pith-pipeline@v0.9.0 · 5819 in / 1194 out tokens · 40925 ms · 2026-05-18T01:39:13.050990+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We introduce CodeClash, a benchmark where LMs compete in multi-round tournaments to build the best codebase for achieving a competitive objective. Each round proceeds in two phases: agents edit their code, then their codebases compete head-to-head in a code arena...
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Models also struggle with long-term codebase maintenance, as repositories become progressively messy and redundant.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

67 extracted references · 67 canonical work pages · 1 internal anchor

[1]

URL https://arxiv.org/abs/2310.06770. D.G. Jones and A.K. Dewdney. Core wars guidelines, 1984. URL https://corewar.co.uk/ standards/cwg.txt. Seth Karten, Andy Luu Nguyen, and Chi Jin. Pok´echamp: an expert-level minimax language agent, 2025. URL https://arxiv.org/abs/2503.04094. Bhavesh Kumar, Hoang Nguyen, and Roger Jin. Husky hold’em bench. https:// hus...

work page internal anchor Pith review Pith/arXiv arXiv 1984
[2]

Execution is crucial to enable models to create and use their own constructs (e.g., analysis scripts, memory systems)

LMs should be able to view execution feedback. Execution is crucial to enable models to create and use their own constructs (e.g., analysis scripts, memory systems)

work page
[3]

A defining challenge of CodeClash is that LMs operate in a self-directed manner

LMs should be able to interact with a codebase. A defining challenge of CodeClash is that LMs operate in a self-directed manner. Workflow-oriented approaches (Xia et al., 2024) are unsuitable for our setting. Going hand-in-hand with (1), interaction is also necessary so that models can string sequences of changes together. 16 CodeClash: Benchmarking Goal-...

work page 2024
[4]

impartial

LMs should operate using bash actions, not tools. As described in Yang et al. (2024b), various workflows and tools can be (un-)intentionally biased to favor particular models. Our goal is to evaluate models, not scaffolds or tools. Therefore, we decide to make LMs operate in the most “impartial” action space. This decision also leaves an opportunity for L...

work page arXiv 2025
[5]

You write a single bash command

work page
[6]

The system executes that command in a subshell

work page
[7]

You write your next command For each of your response:

work page
[8]

Include a THOUGHT section explaining your reasoning and what you’re trying to accomplish

work page
[9]

Provide exactly ONE bash command to execute

work page
[10]

The action must be enclosed in triple backticks (see below for formatting rules)

work page
[11]

Every ac- tion is executed in a new subshell

Directory or environment variable changes are not persistent. Every ac- tion is executed in a new subshell. However, you can prefix any action with MY ENV VAR=MY VALUE cd /path/to/working/dir && ... or write/load environment variables from files Format your responses like this: <format example> THOUGHT: Here I explain my reasoning process, analysis of the...

work page 2024
[12]

The model is the only one with a valid submission (for example because the other model’s submission does not compile or execute)

work page
[13]

I have made all the changes I think are necessary. I will now conclude this round [END action]

The model scores higher than all others. Scores a typically either win rates (across all repetitions of the arena), or other aggregate quantities (e.g., total amount of money won in poker). Distributions of round scores for different arenas are shown in Figure 24. Because of the sequential nature of a tournament, the scores of the rounds are not independe...

work page 1952
[14]

Recovery time

We find that malformed actions does not constitute a significant reason for why mod- els might struggle in CodeClash. 1 2 3 4 Recovery Time (Steps) 0.0 0.2 0.4 0.6 0.8 1.0 P(Recovery takes > X steps) Claude Sonnet 4.5 Qwen3 Coder o3 Gemini 2.5 Pro GPT-5 GPT-5 Mini Grok Code Fast Claude Sonnet 4 Figure 38: “Recovery time” is the num- ber of steps between a...

work page 2025
[15]

What motivated the edits

work page
[16]

unknown

What steps were taken to validate the edits All questions that are marked as boolean need to be answered with a boolean value . You cannot answer " unknown " or similar . ## Definitions ** Main player file **: You are investigating an LM agent that is playing a game . The main player file is the main file that constitutes the agent 's submission , i . e ....

work page
[17]

Only comments , documentation , refactoring was performed

`none `: No change in behavior . Only comments , documentation , refactoring was performed

work page
[18]

`tweak `: Logic is left unchanged , but we do change some parameters

work page
[19]

`fix `: Small , targeted change with the intent to fix broken behavior

work page
[20]

` feature `: Significant new behavior is added , mostly extending the existing code

work page
[21]

`change `: We significantly change the behavior by rewriting significant logic of the code . Notes :

work page
[22]

Only count the final edits to the main player file ( any edits that are reverted are not counted )

work page
[23]

For this question , only the main player file is considered

work page
[24]

For feature or change , the order is not important , choose what better describes the changes

Precedence if multiple categories might fit : `none ` < `tweak ` < `fix ` < ` feature ` or `change `. For feature or change , the order is not important , choose what better describes the changes

work page
[25]

Ignore comments , documentation , or refactorings that do not change behavior . ## Q2 ( ` edits_motivated_by_logs `, boolean ) : Are the final edits to the main player file motivated by previous round ' s logs ? 51 CodeClash: Benchmarking Goal-Oriented Software Engineering Are the ** FINAL ** (!) edits to the ** MAIN PLAYER FILE (!) ** of the player direc...

work page
[26]

Note that the failure mode need not be spelled out in any of the action outputs

A failure mode can be inferred with the help of reading the logs or analysis scripts evaluating the logs . Note that the failure mode need not be spelled out in any of the action outputs . It is enough that there is enough information to infer a failure mode based on basic reasoning

work page
[27]

It is ok if some minor parts of the edit are unrelated

The edit is directly related to this failure mode . It is ok if some minor parts of the edit are unrelated . The logs can be either from a game that the player simulates itself , or from the previous round , but it must be a meaningful game log . Here are some examples of real failure modes : - The snake that the player is controlling runs out of food ( s...

work page
[28]

Player does not look at logs

work page
[29]

Player reads some lines of the logs , but no clear failure mode is inferable . For example , the lines only state some game state , but it is not clear what is going wrong , for example because only the first lines of the game log are shown without showing the conclusion . Or the logs only show which player won but without much of a reason

work page
[30]

For example , the analysis script only reports losses , without attribution of what went wrong

Player runs a script that analyzes logs , but the analysis script does not return an actionable outcome or information that allows to infer it . For example , the analysis script only reports losses , without attribution of what went wrong

work page
[31]

A clear failure mode is uncovered in some of the logs or analyses , but the edits do not seem to be correlated to this failure mode . ## Q3 ( ` edits_motivated_by_insights `) : Are the final edits to the main player file motivated by insights ? Can the goal of the ** FINAL ** (!) edits to the ** MAIN PLAYER FILE (!) ** be motivated by any insights based o...

work page
[32]

The player wrote a meaningful test that revealed a problem ( or a way to improve ) and then performed the corresponding edit

work page
[33]

The player wrote a meaningful analysis script that revealed a problem ( or a way to improve ) and then performed the corresponding edit

work page
[34]

The player ran some test games that revealed a problem ( or a way to improve ) and then performed the corresponding edit

work page
[35]

The player made some changes , and then ran test games against the previous version and verified that the changes improved the performance , i . e . , had a higher win rate . However , if for 1. and 2. the test or analysis script gives a recommendation that 's not corroborated by the actual code of the analysis or test file , or by its respective output ,...

work page
[36]

Old : Were not created during the trajectory , i . e . , you do not see how they were created

work page
[37]

A common case is generic notes in ` README_agent

Static : Are always shown and do not depend on any tests or analysis outcomes . A common case is generic notes in ` README_agent . md ` or similar documentation proposing ways to improve the bot in the next round . This question is independent of the previous questions ( ` edits_motivated_by_logs `, ` edits_motivated_by_insights `) : The final edits can b...

work page
[38]

Unit tests showed that the edits introduced issues

work page
[39]

Simulations showed that the edits introduced issues or had a lower win rate Do not consider edits that failed because of incorrect usage of the edit tools or other problems that caused the edits to not take effect at all . ## Q6 ( ` edits_tested_with_simulations `) : Are the final edits to the main player file tested with simulations of the game ? Are the...

work page
[40]

If the games failed to run , or showed that the new version was clearly worse than the previous version , answer False

work page
[41]

If it was not verified who won the games , also answer False

work page
[42]

Unit tests do NOT (!) count as a simulated game

work page
[43]

It is acceptable to have some minor edits performed after the simulation , as long as the core idea of the final edits is included

The validation by simulation does not have to take place at the very end , but it has to be played with the updated version of the main player file that includes the 53 CodeClash: Benchmarking Goal-Oriented Software Engineering core implementation of the idea of the final edits . It is acceptable to have some minor edits performed after the simulation , a...

work page
[44]

Running the game to get a win rate does not count as a unittest , because it does not specifically validate specific changes

work page
[45]

Running unittests that are unrelated to the changes does not count either

work page
[46]

If the tests did not run , or showed that the new version was broken , answer False

work page
[47]

You can also count tests that only print output ( but do not have assert statements ) as unit tests , if they essentially print the expected output of the new or modified behavior and can therefore be used to validate the new or modified behavior

work page
[48]

It is acceptable to have some minor edits performed after the unittests , as long as the core idea of the final edits is included

The validation by unittests does not have to take place at the very end , but it has to be performed with the updated version of the main player file that includes the core implementation of the idea of the final edits . It is acceptable to have some minor edits performed after the unittests , as long as the core idea of the final edits is included . Spec...

work page
[49]

An additional test was added to a test script or unittest framework

work page
[50]

The analysis script was improved to look for a new behavior or failure mode

work page
[51]

A script to help running simulated games and to parse the results The following are examples of non - significant improvements :

work page
[52]

Static messages or comments are added to the test or analysis framework ( e . g . , generic improvement notes that are independent of actual observations )

work page
[53]

Documentation of the tests or analysis scripts

work page
[54]

Analysis or test scripts that are specific to the current round and are not expected to be useful for the next round . Notes :

work page
[55]

If a test or analysis is executed without being saved to disk , it does not count as an improvement ( i . e . , ` python -c ` calls , shell one - liners , etc .)

work page
[56]

If a test or analysis script is removed after being executed , it does not count

work page
[57]

log " ,

This question is completely independent of the main player file and all other questions . ## Output format Answer in the json format specified . The ` reasoning ` field should contain an explanation for your answer that explains your reasoning for each of the answers . Include general statements / observations first , then write down your reasoning for ea...

work page
[58]

There is the following bug in the code

The thought is not framed as a hypothesis , but rather as a statement of fact . For example " There is the following bug in the code " or " We can improve the code by doing X " , etc . Do not include thoughts that are framed as future actions , e . g . , " I will now do X "

work page
[59]

The statement of fact is concrete

work page
[60]

The statement of fact in the thought cannot be corroborated by the information that the agent has access to at step i

work page
[61]

The agent also cannot come to the conclusion by common sense knowledge and reasoning about the information that the agent has access to at step i

work page
[62]

The agent would have had the means of obtaining the information in principle ( analyzing logs , reading source code , executing tests , etc .)

work page
[63]

There is the following bug in the code

The incident , i . e . , the uncorroborated and potentially incorrect statement of fact is relevant to the overall trajectory and the objective of the agent , i . e . , the final goal of the agent winning the game . In other words , the potentially incorrect statement of fact might have reduced the agent 's chances of winning the game . ### Examples of th...

work page
[64]

Do NOT (!) skip any action

You MUST (!) categorize EVERY (!) action . Do NOT (!) skip any action

work page
[65]

Every action MUST (!) be put into exactly (!) one (!) category

work page
[66]

Your category MUST (!) be one of the list above

work page
[67]

analyze

If you are unsure , use the best match for the category . In Figure 46, read combines the navigation, search, and read operations. Claude Sonnet 4.5 loses to a static solution written by a human expert. As discussed in Section 4.1, we run 10 tournaments of Claude Sonnet 4.5 , the top model on the RobotRum- ble arena, against the top open-source submission...

work page 2025

[1] [1]

URL https://arxiv.org/abs/2310.06770. D.G. Jones and A.K. Dewdney. Core wars guidelines, 1984. URL https://corewar.co.uk/ standards/cwg.txt. Seth Karten, Andy Luu Nguyen, and Chi Jin. Pok´echamp: an expert-level minimax language agent, 2025. URL https://arxiv.org/abs/2503.04094. Bhavesh Kumar, Hoang Nguyen, and Roger Jin. Husky hold’em bench. https:// hus...

work page internal anchor Pith review Pith/arXiv arXiv 1984

[2] [2]

Execution is crucial to enable models to create and use their own constructs (e.g., analysis scripts, memory systems)

LMs should be able to view execution feedback. Execution is crucial to enable models to create and use their own constructs (e.g., analysis scripts, memory systems)

work page

[3] [3]

A defining challenge of CodeClash is that LMs operate in a self-directed manner

LMs should be able to interact with a codebase. A defining challenge of CodeClash is that LMs operate in a self-directed manner. Workflow-oriented approaches (Xia et al., 2024) are unsuitable for our setting. Going hand-in-hand with (1), interaction is also necessary so that models can string sequences of changes together. 16 CodeClash: Benchmarking Goal-...

work page 2024

[4] [4]

impartial

LMs should operate using bash actions, not tools. As described in Yang et al. (2024b), various workflows and tools can be (un-)intentionally biased to favor particular models. Our goal is to evaluate models, not scaffolds or tools. Therefore, we decide to make LMs operate in the most “impartial” action space. This decision also leaves an opportunity for L...

work page arXiv 2025

[5] [5]

You write a single bash command

work page

[6] [6]

The system executes that command in a subshell

work page

[7] [7]

You write your next command For each of your response:

work page

[8] [8]

Include a THOUGHT section explaining your reasoning and what you’re trying to accomplish

work page

[9] [9]

Provide exactly ONE bash command to execute

work page

[10] [10]

The action must be enclosed in triple backticks (see below for formatting rules)

work page

[11] [11]

Every ac- tion is executed in a new subshell

Directory or environment variable changes are not persistent. Every ac- tion is executed in a new subshell. However, you can prefix any action with MY ENV VAR=MY VALUE cd /path/to/working/dir && ... or write/load environment variables from files Format your responses like this: <format example> THOUGHT: Here I explain my reasoning process, analysis of the...

work page 2024

[12] [12]

The model is the only one with a valid submission (for example because the other model’s submission does not compile or execute)

work page

[13] [13]

I have made all the changes I think are necessary. I will now conclude this round [END action]

The model scores higher than all others. Scores a typically either win rates (across all repetitions of the arena), or other aggregate quantities (e.g., total amount of money won in poker). Distributions of round scores for different arenas are shown in Figure 24. Because of the sequential nature of a tournament, the scores of the rounds are not independe...

work page 1952

[14] [14]

Recovery time

We find that malformed actions does not constitute a significant reason for why mod- els might struggle in CodeClash. 1 2 3 4 Recovery Time (Steps) 0.0 0.2 0.4 0.6 0.8 1.0 P(Recovery takes > X steps) Claude Sonnet 4.5 Qwen3 Coder o3 Gemini 2.5 Pro GPT-5 GPT-5 Mini Grok Code Fast Claude Sonnet 4 Figure 38: “Recovery time” is the num- ber of steps between a...

work page 2025

[15] [15]

What motivated the edits

work page

[16] [16]

unknown

What steps were taken to validate the edits All questions that are marked as boolean need to be answered with a boolean value . You cannot answer " unknown " or similar . ## Definitions ** Main player file **: You are investigating an LM agent that is playing a game . The main player file is the main file that constitutes the agent 's submission , i . e ....

work page

[17] [17]

Only comments , documentation , refactoring was performed

`none `: No change in behavior . Only comments , documentation , refactoring was performed

work page

[18] [18]

`tweak `: Logic is left unchanged , but we do change some parameters

work page

[19] [19]

`fix `: Small , targeted change with the intent to fix broken behavior

work page

[20] [20]

` feature `: Significant new behavior is added , mostly extending the existing code

work page

[21] [21]

`change `: We significantly change the behavior by rewriting significant logic of the code . Notes :

work page

[22] [22]

Only count the final edits to the main player file ( any edits that are reverted are not counted )

work page

[23] [23]

For this question , only the main player file is considered

work page

[24] [24]

For feature or change , the order is not important , choose what better describes the changes

Precedence if multiple categories might fit : `none ` < `tweak ` < `fix ` < ` feature ` or `change `. For feature or change , the order is not important , choose what better describes the changes

work page

[25] [25]

Ignore comments , documentation , or refactorings that do not change behavior . ## Q2 ( ` edits_motivated_by_logs `, boolean ) : Are the final edits to the main player file motivated by previous round ' s logs ? 51 CodeClash: Benchmarking Goal-Oriented Software Engineering Are the ** FINAL ** (!) edits to the ** MAIN PLAYER FILE (!) ** of the player direc...

work page

[26] [26]

Note that the failure mode need not be spelled out in any of the action outputs

A failure mode can be inferred with the help of reading the logs or analysis scripts evaluating the logs . Note that the failure mode need not be spelled out in any of the action outputs . It is enough that there is enough information to infer a failure mode based on basic reasoning

work page

[27] [27]

It is ok if some minor parts of the edit are unrelated

The edit is directly related to this failure mode . It is ok if some minor parts of the edit are unrelated . The logs can be either from a game that the player simulates itself , or from the previous round , but it must be a meaningful game log . Here are some examples of real failure modes : - The snake that the player is controlling runs out of food ( s...

work page

[28] [28]

Player does not look at logs

work page

[29] [29]

Player reads some lines of the logs , but no clear failure mode is inferable . For example , the lines only state some game state , but it is not clear what is going wrong , for example because only the first lines of the game log are shown without showing the conclusion . Or the logs only show which player won but without much of a reason

work page

[30] [30]

For example , the analysis script only reports losses , without attribution of what went wrong

Player runs a script that analyzes logs , but the analysis script does not return an actionable outcome or information that allows to infer it . For example , the analysis script only reports losses , without attribution of what went wrong

work page

[31] [31]

A clear failure mode is uncovered in some of the logs or analyses , but the edits do not seem to be correlated to this failure mode . ## Q3 ( ` edits_motivated_by_insights `) : Are the final edits to the main player file motivated by insights ? Can the goal of the ** FINAL ** (!) edits to the ** MAIN PLAYER FILE (!) ** be motivated by any insights based o...

work page

[32] [32]

The player wrote a meaningful test that revealed a problem ( or a way to improve ) and then performed the corresponding edit

work page

[33] [33]

The player wrote a meaningful analysis script that revealed a problem ( or a way to improve ) and then performed the corresponding edit

work page

[34] [34]

The player ran some test games that revealed a problem ( or a way to improve ) and then performed the corresponding edit

work page

[35] [35]

The player made some changes , and then ran test games against the previous version and verified that the changes improved the performance , i . e . , had a higher win rate . However , if for 1. and 2. the test or analysis script gives a recommendation that 's not corroborated by the actual code of the analysis or test file , or by its respective output ,...

work page

[36] [36]

Old : Were not created during the trajectory , i . e . , you do not see how they were created

work page

[37] [37]

A common case is generic notes in ` README_agent

Static : Are always shown and do not depend on any tests or analysis outcomes . A common case is generic notes in ` README_agent . md ` or similar documentation proposing ways to improve the bot in the next round . This question is independent of the previous questions ( ` edits_motivated_by_logs `, ` edits_motivated_by_insights `) : The final edits can b...

work page

[38] [38]

Unit tests showed that the edits introduced issues

work page

[39] [39]

Simulations showed that the edits introduced issues or had a lower win rate Do not consider edits that failed because of incorrect usage of the edit tools or other problems that caused the edits to not take effect at all . ## Q6 ( ` edits_tested_with_simulations `) : Are the final edits to the main player file tested with simulations of the game ? Are the...

work page

[40] [40]

If the games failed to run , or showed that the new version was clearly worse than the previous version , answer False

work page

[41] [41]

If it was not verified who won the games , also answer False

work page

[42] [42]

Unit tests do NOT (!) count as a simulated game

work page

[43] [43]

It is acceptable to have some minor edits performed after the simulation , as long as the core idea of the final edits is included

The validation by simulation does not have to take place at the very end , but it has to be played with the updated version of the main player file that includes the 53 CodeClash: Benchmarking Goal-Oriented Software Engineering core implementation of the idea of the final edits . It is acceptable to have some minor edits performed after the simulation , a...

work page

[44] [44]

Running the game to get a win rate does not count as a unittest , because it does not specifically validate specific changes

work page

[45] [45]

Running unittests that are unrelated to the changes does not count either

work page

[46] [46]

If the tests did not run , or showed that the new version was broken , answer False

work page

[47] [47]

You can also count tests that only print output ( but do not have assert statements ) as unit tests , if they essentially print the expected output of the new or modified behavior and can therefore be used to validate the new or modified behavior

work page

[48] [48]

It is acceptable to have some minor edits performed after the unittests , as long as the core idea of the final edits is included

The validation by unittests does not have to take place at the very end , but it has to be performed with the updated version of the main player file that includes the core implementation of the idea of the final edits . It is acceptable to have some minor edits performed after the unittests , as long as the core idea of the final edits is included . Spec...

work page

[49] [49]

An additional test was added to a test script or unittest framework

work page

[50] [50]

The analysis script was improved to look for a new behavior or failure mode

work page

[51] [51]

A script to help running simulated games and to parse the results The following are examples of non - significant improvements :

work page

[52] [52]

Static messages or comments are added to the test or analysis framework ( e . g . , generic improvement notes that are independent of actual observations )

work page

[53] [53]

Documentation of the tests or analysis scripts

work page

[54] [54]

Analysis or test scripts that are specific to the current round and are not expected to be useful for the next round . Notes :

work page

[55] [55]

If a test or analysis is executed without being saved to disk , it does not count as an improvement ( i . e . , ` python -c ` calls , shell one - liners , etc .)

work page

[56] [56]

If a test or analysis script is removed after being executed , it does not count

work page

[57] [57]

log " ,

This question is completely independent of the main player file and all other questions . ## Output format Answer in the json format specified . The ` reasoning ` field should contain an explanation for your answer that explains your reasoning for each of the answers . Include general statements / observations first , then write down your reasoning for ea...

work page

[58] [58]

There is the following bug in the code

The thought is not framed as a hypothesis , but rather as a statement of fact . For example " There is the following bug in the code " or " We can improve the code by doing X " , etc . Do not include thoughts that are framed as future actions , e . g . , " I will now do X "

work page

[59] [59]

The statement of fact is concrete

work page

[60] [60]

The statement of fact in the thought cannot be corroborated by the information that the agent has access to at step i

work page

[61] [61]

The agent also cannot come to the conclusion by common sense knowledge and reasoning about the information that the agent has access to at step i

work page

[62] [62]

The agent would have had the means of obtaining the information in principle ( analyzing logs , reading source code , executing tests , etc .)

work page

[63] [63]

There is the following bug in the code

The incident , i . e . , the uncorroborated and potentially incorrect statement of fact is relevant to the overall trajectory and the objective of the agent , i . e . , the final goal of the agent winning the game . In other words , the potentially incorrect statement of fact might have reduced the agent 's chances of winning the game . ### Examples of th...

work page

[64] [64]

Do NOT (!) skip any action

You MUST (!) categorize EVERY (!) action . Do NOT (!) skip any action

work page

[65] [65]

Every action MUST (!) be put into exactly (!) one (!) category

work page

[66] [66]

Your category MUST (!) be one of the list above

work page

[67] [67]

analyze

If you are unsure , use the best match for the category . In Figure 46, read combines the navigation, search, and read operations. Claude Sonnet 4.5 loses to a static solution written by a human expert. As discussed in Section 4.1, we run 10 tournaments of Claude Sonnet 4.5 , the top model on the RobotRum- ble arena, against the top open-source submission...

work page 2025